DUD ! r ' • ' ur™ 

NaV O.HADJA* V’OOi 
MON » • - CA 9s394v 0 1 



NAVAL POSTGRADUATE SCHOOL 
Monterey, California 



FAULT-TOLERANT APPROACH FOR DEPLOYING 
SERVER AGENT-BASED ACTIVE NETWORK MANAGEMENT (SAAM) 
SERVER IN WINDOWS NT ENVIRONMENT TO PROVIDE 
UNINTERRUPTED SERVICES TO ROUTERS INCASE OF SERVER 

FAILURE(S) 




THESIS 



by 



Efraim KATI 



March 2000 



Thesis Advisor: 
Second Reader: 



Geoffrey Xie 
James Bret Michael 



Approved for public release; distribution is unlimited. 




REPORT DOCUMENTATION PAGE 



Form Approved 
OMB No. 0704-0188 



Public reporting burden for this collection of information is estimated to average 1 hour per response, including the ume for reviewing instruction, searching existing 
data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate 
Or day othc. aspect crt this coUeC'Qon of infunnadun, lOvlutliiig suggestions for reducing this huidcn, to Vv a^hiagiGu headquartcia Set vices, Dnecturaie fur tniurmatiuu 
Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction 
Project (0704-0188) Washington DC 20503. 



1. AGENCY USE ONLY (Leave blank) 



2. REPORT DATE 

March 2000 



3. REPORT TYPE AND DATES COVERED 

Master’s Thesis 



4. TITLE AND SUBTITLE: FAULT TOLERANT APPROACH FOR DEPLOYING SERVER AGENT 
BASED ACTIVE NETWORK MANAGEMENT (SAAM) SERVER IN WINDOWS NT ENVIRONMENT TO 
PROVIDE UNINTERRUPTED SERVICES TO ROUTERS INCASE OF SERVER FAILURE(S). 



6. author(S) Kati, Efraim 



7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 

Naval Postgraduate School 
Monterey, CA 93943-5000 



5. FUNDING NUMBERS 



8. PERFORMING ORGANIZATION 
REPORT NUMBER 



9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESSES) 

DARPA and NASA 



10. SPONSORING /MONITORING 
AGENCY REPORT NUMBER 

G417 



11. SUPPLEMENTARY NOTES 

The views expressed in this thesis are those of the authors and do not reflect the official policy or position of 
the Department of Defense or the U.S. Government. 



12a. DISTRIBUTION /AVAILABILITY STATEMENT 

Approved for public release; distribution is unlimited. 



12b. DISTRIBUTION CODE 

Statement A 



13. ABSTRACT (maximum 200 words) 

With the explosive growth of the Internet and high demand on real-time network applications, the need for 
integrated service networks has emerged. Therefore, the Next Generation Internet project and as a part of this project the 
Server Agent-based Active network Management (SAAM) project was initiated. SAAM is a server based hierarchical 
routing architecture designed to provide Quality of Service routing services for network resource intensive applications. 
In the SAAM architecture, a small number of dedicated SAAM servers perform most of the network management tasks 
on behalf of the routers. The SAAM server has a great responsibility in the SAAM architecture and failure of the SAAM 
server can have a devastating effect on the performance of the entire network. In order to tolerate the failure of the 
SAAM server, this thesis examines the fault tolerance (ft.) for the SAAM server in two phases: local area ft., and remote 
area ft. For the local area ft., after a survey of the literature and commercial offerings, a recommended solution is 
proposed. For the remote area ft., a backup server model is designed and prototyped. The prototyped model provides 
robust error detection and a fast recovery from the failure of the primary SAAM server. 



14. SUBJECT TERMS Fault Tolerance, Heartbeat Protocol, Next Generation Internet, Networks 


15. NUMBER OF PAGES 

328 


16. PRICE CODE 


17. SECURITY CLASSIFICATION 
OF REPORT 

Unclassified 


18. SECURITY CLASSIFICATION 
OF THIS PAGE 

Unclassified 


19. SECURITY CLASSIFI- 
CATION OF ABSTRACT 

Unclassified 


20. LIMITATION OF ABSTRACT 

UL 



NSN 7540-01-280-5500 



Standard Fnrm 298 (Rev. 2-89) 
Prescribed by ANSI Std. 239-18 
298-102 



I 




THIS PAGE INTENTIONALLY LEFT BLANK 



Approved for public release; distribution is unlimited 



FAULT-TOLERANT APPROACH FOR DEPLOYING 
SERVER AGENT-BASED ACTIVE NETWORK MANAGEMENT (SAAM) SERVER IN 
WINDOWS NT ENVIRONMENT TO PROVIDE UNINTERRUPTED SERVICES TO 
ROUTERS INCASE OF SERVER FAILURE(S) 



Effaim KATI 

First Lieutenant, Turkish Army 
B.S., Turkish Military Academy, 1992 



Submitted in partial fulfillment of the 
requirements for the degree of 

MASTER OF SCIENCE IN COMPUTER SCIENCE 

from the 

NAVAL POSTGRADUATE SCHOOL 
March 2000 



' A } ~ i c 

X-OOO O'' 

)Apr\ \ c\ 




THIS PAGE INTENTIONALLY LEFT BLANK 



ABSTRACT 



The current data networks are mainly based on sophisticated stand-alone routers 
that provide best effort service. However, with the explosive growth of the Internet and 
high demand on real-time network applications, the need for integrated service networks 
has emerged. For this purpose the Next Generation Internet (NGI) Project and as a part of 
this project the Server Agent based Active network Management (SAAM) project was 
initiated. SAAM is a server based hierarchical routing architecture designed to provide 
Quality of Service (QoS) routing services for network resource intensive applications. In 
the SAAM architecture, a small number of dedicated SAAM servers perform most of the 
network management tasks on behalf of the routers. The SAAM server has a great 
responsibility in the SAAM architecture and failure of the SAAM server can have a 
devastating effect on the performance of the entire network. In order to tolerate the failure 
of the SAAM server and provide uninterrupted services to routers, this thesis examines 
the fault tolerance for the SAAM server in two phases: local area fault tolerance, and 
remote area (disaster recovery) fault tolerance. For the local area fault tolerance, after a 
survey of the literature and commercial offerings, a recommended solution is proposed. 
For the remote area fault tolerance, a backup server model is designed and prototyped. 
The prototyped model provides robust error detection and a fast recovery from the failure 
of the primary SAAM server. 



v 



THIS PAGE INTENTIONALLY LEFT BLANK 



vi 



TABLE OF CONTENTS 



I. INTRODUCTION 1 

A. BACKGROUND 1 

B. OVERVIEW OF SAAM 2 

C. PURPOSE OF THIS THESIS 5 

D. SCOPE OF THIS THESIS 6 

E. ORGANIZATION OF THIS THESIS 6 

H. OVERVIEW OF FAULT TOLERANCE 9 

A. BASIC CONCEPTS AND DEFINITIONS 9 

B. REDUNDANCY CONCEPT 12 

1 . Hardware Redundancy 13 

a. Passive Hardware Redundancy 13 

b. Active Hardware Redundancy 15 

2. Software Redundancy 18 

a. Consistency Checks 19 

b. Capability Checks 19 

c. N-Version Programming 19 

d. Recovery Blocks 20 

3. Information Redundancy 21 

4. Time Redundancy 22 

C. OBJECTIVES OF FAULT TOLERANCE 22 

1. Dependability 23 

2. Reliability 23 

3. Availability 25 

4. Safety 26 

5. Performability 27 

6. Maintainability 27 

7. Testability 28 

D. PHASES IN FAULT TOLERANCE 28 

1 . Error Detection 29 

a. Replication Checks 30 

b. Timing Checks 31 

c. Structural Checks 31 

d. Reasonableness Checks 32 

e. Diagnostics Checks 32 

2. Damage Confinement and Assessment 32 

3. Error Correction 33 

a. Error Recovery 33 

b. Error Masking 34 

4. Fault Treatment and Continued Service 34 

vii 



m. FAULT TOLERANCE IN WINDOWS NT OPERATING SYSTEM 37 

A. ERROR HANDLING AND PROTECTED SUBSYSTEMS 38 

B. NT FILE SYSTEM (NTFS) 39 

C. AUTOMATIC RESTART 40 

D. TAPE BACKUP SUPPORT 42 

E. UNINTERRUPTIBLE POWER SUPPLY (UPS) 43 

F. FAULT-TOLERANT STORAGE 44 

1. Stripe Set 46 

2. Mirror Set 46 

3. Stripe Set With Parity 47 

G. MICROSOFT CLUSTER SERVER (MSCS) 49 

1. Overview of Server Cluster 49 

2. MSCS 50 

H. WINDOWS NT LOAD BALANCING SERVICE (WLBS) 56 

1. WLBS Features 57 

2. WLBS Shortcomings 59 

IV. LOCAL AREA FAULT TOLERANCE FOR SAAM SERVER 61 

A. PRODUCTS OVERVIEW 61 

1. ARCserve Replication 4.0 for Windows NT 62 

2. Co-StandbyServer 4.2 for Widows NT 66 

3. Double-Take 3.0 71 

4. Endurance 4000 82 

5. Octopus 3. 2 89 

B. RECOMMENDATIONS 93 

V. REMOTE AREA FAULT TOLERANCE FOR SAAM SERVER 97 

A. MODELING 97 

1. Server States 98 

2. Failure Detection 100 

a. Constant Heartbeat Protocol (V.0) 102 

b. Accelerated Heartbeat Protocol (V. 1) 104 

c. Prototyping of the Heartbeat Protocols 107 

d. Performance Comparison of the Heartbeat Protocols 119 

e. Preventing False Failure Detection 128 

f. Preventing Late Failure Detection 133 

g. Existence of Two Active Servers at the Same Time 135 

3. Damage Confinement and Assessment 136 

4. Failure Recovery 139 

a. Size of PIB Data Records 140 

b. Selected Approach 144 

viii 



5. Fault Treatment and Continued Service 145 

B. INTEGRATION WITH THE EXISTING SOURCE SODE 147 

1 . Packet Formats 147 

2. Integration of Error Detection Mechanism 150 

a. Hear tbeatQuery Class 151 

b. HeartbeatResponse Class 152 

c. HeartbeatController Class 153 

d . BannerFrame Class 154 

3. Modifications Done on the Existing Source Code 155 

a. Message Class 155 

b. Server Class 156 

c . ServerAgent Class 158 

d. PacketFactory Class 158 

C. TESTING 159 

1. Testbed 159 

2. Tests Performed 161 

a. Failure Detection Test 162 

b. Heartbeat Response Message Loss Test 165 

c. Message Numbering Scheme Test 168 

d. Unsolicited Heartbeat Test 173 

e. Control Channel Auto-configuration Test 175 

VI. CONCLUSIONS 179 

A. SYNOPSIS AND CONCLUSION 179 

B. FUTURE WORK 180 

1. Testing of Recommended COTS-Based Product 181 

2. Detection of Backup S AAM Server Failures 181 

3. Reinstating of a Repaired Server 182 

4. Handling of Two Simultaneously Active Servers 1 82 

5. Field Test 182 

6. Alert Mechanism 1 83 

APPENDIX A. THE CONSTANT HEARTBEAT PROTOCOL SOURCE FILES 185 

APPENDIX B. THE ACCELERATED HEARTBEAT PROTOCOL SOURCE FILES207 

APPENDIX C. NEWLY ADDED SOURCE FILES FOR INTEGRATION 241 

APPENDIX D. MODIFIED SOURCE FILES FOR INTEGRATION 259 

LIST OF REFERENCES 307 

INITIAL DISTRIBUTION LIST 31 1 



IX 



THIS PAGE INTENTIONALLY LEFT BLANK 



x 



ACKNOWLEDGEMENT/DEDICATIONS 



The author would like to acknowledge the Defense Advance Research Projects 
Agency (DARPA) and the National Aeronautics and Space Agency (NASA) for 
sponsoring this thesis research. 

I would like to extend my sincere gratitude to my thesis advisor. Professor 
Geoffrey Xie and Professor James Bret Michael, for their patience. I would like to thank 
my wife, Ozlem Kati for her love and support during the entire thesis process. I would 
also like to thank to my parents, Arife and Sakir Kati, without whose love and sacrifice I 
would not have the opportunities that I am enjoying today. And finally, I would like to 
dedicate this thesis to my son, Efekan Ozgur Kati for his unconditional love. 



xi 



I. 



INTRODUCTION 



A. BACKGROUND 

With the explosive growth of the Internet and increasing demand on real-time 
network applications, the need for an integrated services network has emerged. An 
integrated services network supports all types of data using a single network and must 
meet two requirements. First, an integrated services network must guarantee Quality of 
Service (QoS) to individual user sessions. Second, an integrated services network must 
support real-time applications that have very stringent packet delay requirements. [Ref. 
35] However, the current data networks are mainly based on sophisticated stand-alone 
routers that provide best-effort service. Therefore, current data networks are not capable 
of supporting integrated services. To solve this problem, the Next Generation Internet 
(NGI) initiative, an advanced research program, was initiated. Specifically, the NGI 
initiative fosters partnerships among academia, industry, and Federal laboratories to 
develop and experiment with technologies that will enable more powerful and versatile 
information networks of the 21st century. 

One proposal developed under the NGI initiative is the SAAM project. [Ref. 35] 
SAAM stands for Server Agent-based Active network Management. The SAAM project 
is currently sponsored by the Defense Advanced Research Projects Agency (DARPA) 
and the National Aeronautics and Space Administration (NASA), and is an ongoing 
project. More information on the SAAM project can be obtained from the project’s 
World Wide Web home page (www.saamnet.org) . 



1 



B. OVERVIEW OF SAAM 



SAAM is a network management system that enables a network to provide 
integrated services. Instead of a totally router-based architecture, SAAM utilizes a server- 
based hierarchical routing architecture that provides Quality of Service (QoS) routing 
services for network resource intensive applications. 

Compared to the current shortest path algorithms, QoS-based routing algorithms 
must deal with more constraints. Therefore, QoS-based routing requires more processing 
power at each router. Due to this processing power requirement, when the QoS-based 
routing is implemented, current sophisticated stand-alone routers can easily become 
performance bottlenecks. However, SAAM relieves individual routers from most routing 
and management tasks by employing a small number of dedicated SAAM servers to 
perform these tasks on behalf of the routers. Such a lightweight router approach 
implemented by SAAM increases the performance of the routers to support QoS-based 
routing for real-time applications. 

The SAAM server maintains an accurate picture of the QoS capabilities of the 
network by periodically retrieving link performance information from the routers, and 
aggregating this information into a ready-to-use database of useful paths. This database is 
called the Path Information Base (PEB) (shown in Figure 1.1). By using the PEB, the 
SAAM server can efficiently implement network functions such as QoS routing and re- 
routing of real-time flows, which are required for providing integrated services. 



2 




Dest next hop 
r 



Path Information Base (PIB) 



path-id path-parameters 



Flow Routing Table 



Flow-id next hop 



tagram Routing Table 



Management 
Functions 
(e.g., QoS routing) 



Figure 1.1. Logical Model of SAAM. 



To make its service scalable for large networks, SAAM organizes its SAAM 
servers in a hierarchy, as shown in Figure 1.2. At the first level of the hierarchy, SAAM 
partitions the network into autonomous regions. These regions are called SAAM regions. 
SAAM assigns one SAAM server for each SAAM region. At the first level, each SAAM 
server is responsible for collecting link performance information from routers in its own 
region and summarizing this collected information for a higher-level server. In the second 
level of the hierarchy, the child servers periodically send the summarized path 
performance information to the parent server. In each SAAM region, there is subset of 
routers, called border gateways, responsible for traffic in and out of this region. The 
parent server determines the routing cross-regions. 



3 




f=(\ router Q SAAM 


.Example ^ 


.SAAM 


-^Legacy 


LU server 


™ data path ^ 


- — -"region 


—^networks 



Figure 1.2. Hierarchical Organization of SAAM Servers. [From Ref. 36] 



The Internet currently consists of many independently operated Internet Service 
Providers (ISPs). In Figure 1.2, the regions are considered as ISPs. SAAM is completely 
compatible with the legacy networks and supports existing inter-domain routing 
protocols. The border gateway routers make the necessary translation between the 
different domain protocols. Therefore, whether or not an ISP is using SAAM is 
transparent to other ISPs. A non-SAAM ISP can still send traffic through a SAAM ISP. 
Therefore, SAAM can be deployed incrementally, providing improvements of network 
performance to ISPs that use SAAM. The ISP that uses a SAAM server has total control 
over the operation of its internal SAAM server. In this case, the parent server only 
provides performance-enhancing advice to the internal servers. The internal server will 

verify the advice based on local policies before updating the states of its routers. 

4 




c. 



PURPOSE OF THIS THESIS 



A SAAM server is responsible for performing most of the routing and network 
management tasks on behalf of the routers in its region. Therefore, the quality of the 
integrated services provided by the network region depends entirely upon the 
performance of the SAAM server. If not carefully designed, a failure of the SAAM server 
can have a devastating effect on the performance of the entire network. 

The main purpose of this thesis is to add fault tolerance features to the SAAM 
architecture in order to tolerate server failures. Consequently, a failure of the SAAM 
server will not interrupt SAAM services to the routers. Fault tolerance features to be 
added should address the following fault tolerance related requirements of the SAAM 
server: 

• No singe point of failure should be allowed. 

• Detection, isolation, and recovery of the failures should happen in seconds 
(preferably under two seconds). 

• Environmental faults such as flood, fire, and earthquake should be 
addressed. 

• Failed SAAM server should be repaired while the system is in operation. 

• Reinstating the repaired SAAM server should not affect the provided 
SAAM services to routers. 



5 



D. SCOPE OF THIS THESIS 



In order to provide a fault tolerance solution for the SAAM server that best meets 
the aforementioned requirements, fault tolerance for the SAAM servers is implemented in 
two phases: local and remote. The first phase, local area fault tolerance for the SAAM 
server, focuses mainly on tolerating component failures of one server such as processor 
failure, disk failure and network interface card failure. The second phase, remote area 
fault tolerance (disaster recovery) for the SAAM server, backup servers are used to 
tolerate environmental faults such as fire, earthquake, and flood that cause unrecoverable 
server failures. The function of the second phase is to tolerate the failure of the local area 
fault tolerance implementation of the SAAM server. 

A Commercial-Off-The-Shelf (COTS) based solution is proposed for local area 
fault tolerance after a survey of the literature and commercial offerings. For remote area 
fault tolerance, a backup server model is designed and implemented. Additionally, the 
implemented model is tested in a live SAAM testbed. 

E. ORGANIZATION OF THIS THESIS 

The remainder of this thesis is organized into the following chapters: 

• Chapter II: Overview of Fault Tolerance . Provides an overview of fault 
tolerance. Also explains basic terminology and the principles of fault 
tolerance. 

• Chapter IE: Fault Tolerance in Windows NT Operating System . Explores 
the fault tolerance related features of the Windows NT operating system. 



6 



• Chapter IV: Local Area Fault Tolerance for SAAM Server . Focuses 
mainly on tolerating component failures of one server. Also examines the 
five most promising third-party products for the Windows NT operating 
system and proposes one of these products as a recommended solution. 

• Chapter V: Remote Area Fault Tolerance for SAAM Server . Focuses on 
tolerating environmental faults such as flood, fire, and earthquake. A 
backup server model is designed and prototyped. Also explains the 
integration of the prototype with the existing SAAM server source code. 

• Chapter VI: Conclusions . Summarizes the test results of the implemented 
backup server model and outlines the work that remains to be done in the 
SAAM project. 



7 



THIS PAGE INTENTIONALLY LEFT BLANK 



8 



II. OVERVIEW OF FAULT TOLERANCE 



A. BASIC CONCEPTS AND DEFINITIONS 

First the definitions of system, error, fault, and failure will be presented. These 
terms are used in a variety of ways in different contexts. Although the terms, fault, 
failure, and error are generally used interchangeably, they have distinct meanings in fault 
tolerance terminology. The definitions presented in this section, will clarify the 
distinctions among the meanings of these terms. Throughout this thesis, the terms, fault, 
failure, and error will be used in accordance with the following provided definitions. 

The concept of a system is quite general and it is used in other disciplines. In 
general, a system is defined as an identifiable mechanism that maintains a pattern of 
behavior at an interface between the system and its environment [Ref. 1]. In computer 
systems, the term interface represents identifiable hardware or physical entities. Although 
a system is considered as a single module, in fact systems are composed of a number of 
subsystems. Therefore, terminology used for the system under consideration also applies 
recursively to its subsystems. 

Each system has an ideal specified behavior and an observed actual behavior. A 
failure is a deviation of the actual behavior from the specified behavior [Ref. 2]. An error 
is that part of the system state which is liable to lead to a failure [Ref. 3]. A fault is a 
physical defect, imperfection, or flaw that occurs within some hardware or software 
component [Ref. 4]. Although a fault has the potential of generating errors, it may not 



9 



cause any errors during the period of observation. On the other hand, the existence of an 
error always indicates that the system has a faulty part. 

Faults can be classified using two different key attributes, duration and cause. 
The duration specifies the length of the time that a fault is active. When duration is 
considered, faults are classified as transient, intermittent or permanent. Transient faults 
appear and disappear within a very short period of time. Intermittent faults repeatedly 
appear, but always for a short duration. Permanent faults remain in existence indefinitely 
if no corrective action is taken. 

According to their causes, faults are classified as design or operational faults. 
Design faults appear during the system design or modification phases. Operational faults 
appear during the system lifetime and they are caused by physical reasons such as 
electromagnetic interference, battle damage, operator mistakes, and environmental 
extremes. 

If the system behaves as it is specified, then this state is called service 
accomplishment state. However if the system behaves different from its specifications, 
then the system enters a state called a service interruption state. Usually a system’s 
actual behavior replicates its specified behavior. As shown in Figure 2.1, a fault 
occasionally creates an error causing a system to fail. 

When a system fails, it enters a service interruption state. After the error is 
detected, reported, and corrected, the system returns to a service accomplishment state. 
The time between the occurrence of a fault and the appearance of an error is called fault 
latency. The time between the occurrence of an error and the appearance of the resulting 



10 



failure is called error latency. The total time between the occurrence of a physical failure 
and the appearance of a failure is the sum of the fault latency and the error latency. 




Fault 



Figure 2.1. Service States of a System. [After Ref. 2] 

In order to improve or maintain the system’s normal performance (i.e., to keep the 

system in the service accomplishment state) three techniques are used: fault avoidance, 

fault masking, and fault tolerance. The fault avoidance technique is used for preventing 

the occurrence of faults. Fault avoidance can include some quality control measures 

implemented before the system becomes operational, such as design reviews, components 

screening, and testing. The fault masking technique is used for preventing faults in a 

system from introducing errors. The fault tolerance technique is used for preventing 

system failures from occurring, even if errors caused by faults appear in the system. Since 

11 




failures are directly caused by errors, the terms fault tolerance and error tolerance are 
often used interchangeably. 

A system is considered to be fault-tolerant if the actual behavior of the system 
stays consistent with its specifications, despite the failures of its sub-systems. It is not 
possible to make a system fault tolerant against its own failures. If the system fails due to 
an error, then there is nothing that can be done in terms of fault tolerance. However, a 
system can be designed to be fault tolerant against the failure of its sub-systems. 
Consequently, the ultimate goal of fault tolerance is to prevent a system failure when 
some of its sub-systems fail. 



B. REDUNDANCY CONCEPT 

Redundancy is defined as those parts of the system that are not needed for the 
correct functioning of the system if no fault tolerance is to be supported [Ref. 1], 
Redundancy is the guiding principle behind fault tolerance. In order to build a fault- 
tolerant system, some redundant sub-systems must exist in the system to be used instead 
of a failed sub-system. Therefore, redundancy is essential for fault tolerance. On the other 
hand, redundancy can introduce some side effects into the system. These side effects can 
take the form of performance degradation, an increase in the size and weight of the 
system, or reduced reliability*. 



' Probability of hardware failure, not the entire system. 



12 



Four types of redundancy could be implemented. They are hardware redundancy, 
software redundancy, time redundancy and information redundancy. The following 
sections will discuss the four redundancy techniques in detail. 

1. Hardware Redundancy 

Hardware redundancy refers to the replication of the hardware components of the 
system. As the hardware sizes have become smaller and less expensive, the concept of 
hardware redundancy becomes more practical. Hardware redundancy can be 
implemented using one of two techniques, passive hardware redundancy and active 
hardware redundancy. 

a. Passive Hardware Redundancy 

Passive hardware redundancy can be used to mask the occurrence of faults 
and to prevent the faults from causing the system to fail. This approach does not require 
any error detection or system reconfiguration. The passive hardware redundancy 
technique implementations rely upon the voting mechanism among the replicated 
hardware components, and inherently tolerate the faults. 

A simple passive hardware redundancy design can be implemented using 
three replicated hardware units and a voter, as shown in Figure 2.2. This type of passive 
redundancy is called Triple Modular Redundancy (TMR). In triple modular redundancy, 
outputs of the three modules are voted and the majority of the output is allowed to pass 
through the voter. If one of the three modules becomes faulty, the remaining two modules 
can mask the faulty module. 



13 



OUTPUT 




Figure 2.2. Basic Triple Modular Redundancy Design. 



The major problem with the TMR is the fact that the voter is a single- 
point-of- failure. A failure of the voter results in the failure of the system. Thus, the 
reliability of the system is directly proportional to the reliability of the voter. One 
approach to overcome this dilemma is to triplex the voters as well as the modules and to 
provide three independent outputs. A sample design for the voter triplexing approach in 
TMR is shown in Figure 2.3. 




Figure 2.3. Triplexing of the Voters in a TMR Design. 



A generalization of the TMR approach is called N-Modular Redundancy 
(NMR). In the NMR technique, N represents the number of replicated modules in the 
design and it is usually an odd number greater than three so that majority voting can be 



14 



used. Although TMR can mask only one faulty component, NMR can mask the effect of 
more than one faulty modules. For example, in a 5-MR design, it is possible to mask two 
faulty modules. It can be shown that the number of faulty components that can be masked 
using the NMR approach equals (N - 1) / 2 . 

The simple and powerful passive hardware redundancy technique seems to 
have solved the hardware fault-tolerance problem. It can mask almost all physical device 
failures. However, it does not mask failures caused by hardware design flaws. If all the 
modules have faulty designs, then the comparators or voters, no matter how many of 
them, will not help. 

Passive hardware redundancy itself does not improve availability or 
reliability. In fact, adding redundancy reduces reliability in some designs as explained 
above. This parallels the airplane analogy: A two-engine airplane costs twice as much 
and has twice as many engine problems as a one-engine airplane. Redundancy designs 
require repair to dramatically improve availability. [Ref. 2] 

b. Active Hardware Redundancy 

In active hardware redundancy, fault tolerance is implemented by fault 
detection, location and recovery instead of by fault masking. In active hardware 
redundancy, erroneous states are acceptable as long as the system is capable of detecting 
the errors, reconfiguring itself and regaining its operational state. 

The main states of the active redundancy approach are shown in Figure 
2.4. Once the fault occurs during normal operation, an error in the system results. If the 
error is not detected and corrected, the consequence will be system failure. Once the error 



15 



is detected, and then the faulty component causing the error has to be located and 
removed from the system’s configuration. Then, the system must be reconfigured with 
the redundant component instead of the failed one. Finally, the system returns to either its 
normal operational state or a degraded operational state, depending on the performance of 
replacement component. 




Figure 2.4. Operation of an Active Hardware Redundancy Approach. [After Ref. 4] 



In the active hardware redundancy approach, the detection of faults is of 
great importance. The duplication with comparison scheme is an example of a fault 
detection mechanism that can be used. Duplication with comparison refers to the 
development of two identical pieces of hardware, having them perform the same 
computation in parallel, and comparing the results of those computations. If the results of 



16 



the two computations do not match, a fault is detected and an error message is generated. 
Although duplication with comparison does not provide fault tolerance by itself, it is 
mainly used as a fault detection mechanism in the active hardware redundancy approach. 

Once the faulty component is detected and identified, the system should be 
reconfigured to replace the faulty component. This reconfiguration can be achieved by a 
technique called standby replacement or standby sparing. Standby replacement refers to 
replacing the faulty component with the provided spares. 

The standby replacement process introduces a momentary interruption in 
the service delivered by the system. To minimize this interruption, a form of standby 
replacement process called hot standby sparing can be used. In hot standby sparing, a 
redundant component operates in parallel with the online component, establishing the 
readiness for the redundant component to take over in the feature. 

There is another form of the standby replacement technique, called cold 
standby sparing. Unlike the hot standby sparing method, the spares in cold standby 
sparing remain non-operational until they are needed. The process of bringing the spare 
into an operational state increases the service interruption time. However, power 
consumption is lower for cold standby sparing than for of hot standby sparing. 

A variation of the standby replacement technique is called pair-and- spare 
or dual-dual. Basically, the pair-and-spare scheme uses both hot standby sparing and 
duplication with comparison techniques in its design. It combines two fail-fast modules, 
as shown in Figure 2.5, to produce a super module that continues operating, even if one 
of the sub modules fails. Fail-fast modules are designed in such a way that they either 
operate correctly or stop operating immediately. This is achieved by using duplication 



17 



with the comparison approach. When the outputs of the modules in the fail-fast module 
do not match, an error signal is generated and the other module is stopped immediately. 
Additionally the two fail-fast modules operate in parallel similar to the hot standby 
sparing method. Because each sub module is fail-fast, the combination is just like the 
logical “OR” of the two sub modules. 




2. Software Redundancy 

Redundancy in software can be implemented in many ways — it is not necessary to 
replicate the complete software program to achieve software redundancy. Software 
redundancy can appear as several extra lines in the code for checking specific values or 
as a routine used to periodically test the specific locations in the system’s memory. The 
following sections will discuss some basic software redundancy techniques. 



18 




a. Consistency Checks 

Consistency checks are used to verify the correctness of specific 
information in the software application. In some applications, a specific set of data 
members are required to remain in certain value ranges. If the value of the data exceeds 
their predetermined value, then an error is declared. Memory address checking 
mechanisms, implemented in the operating system software, can be given as an example 
of consistency checks. The memory address checking mechanism determines if the 
address access generated by the application is outside of the available address range of 
the memory. 



b. Capability Checks 

Capability checks are performed to verify that a system possesses the 
required capability. For example, a capability check would be useful if one would like to 
know if the entire memory is available, or if all of the processors in a multiprocessor 
system are working properly. Or, one might want to know if the ALU in the processor is 
working properly, in which case the capability check would again be used. [Ref. 4] 

c. N -Version Programming 

N- version programming attempts to tolerate the design faults in software 
modules. The basic principle of the N-version programming is to produce the code of the 
same application N-times with the same specifications and same functionality, but by 
using different programmers and then voting among the outputs of these N results 



produced by these different software versions. (Illustrated in Figure 2.6.) The concept of 

19 



producing different versions of the same software provides different failure modes in 
each version and is called design diversity. 

Unfortunately, even different programmers can make the same mistake, or 
a common mistake can arise from the original specification. /V-version programming is 
expensive, raising the system implementations and maintenance cost by factor of N or 
more [Ref. 4]. 



Program 

inputs 




Figure 2.6. The N-Version Programming Concept. [From Ref. 4] 



d. Recovery Blocks 

The recovery block approach is very similar to the cold standby sparing 
approach that is used in hardware redundancy. The concept is illustrated in Figure 2.7. 
One of the N versions of a program, called primary version, is always used, unless it fails 
to pass the acceptance tests. The acceptance checks are, essentially, checks performed on 



the results produced by the program and may be created using consistency checks and 

20 



capability checks. If the primary version fails to pass the acceptance tests, then the first 
secondary version is tried. This process continues until one version passes the acceptance 
tests. When all versions fail to pass the acceptance test, a system failure will occur. 
Assuming perfect coverage and independent faults, the recovery block approach can 
tolerate up to N - 1 faults. [Ref. 4] 




Figure 2.7. The Recovery Block Approach. [From Ref. 4] 



3. Information Redundancy 

Information redundancy refers to the addition of redundant information to the data 
with the objective of providing fault tolerance. The purpose of the information 
redundancy is to protect the state of the information or to protect the transport of 
messages. The basic idea behind adding extra bits is so that errors in some bits can be 
detected, and if possible, corrected. The process of adding check bits is called encoding. 
The reverse process of extracting the information from the encoded information is called 
decoding. Error detection and error correction codes (e.g., the Hamming Code) are 
examples of the information redundancy technique. 



21 



4. 



Time Redundancy 



Time redundancy refers to the repetition of a computation or communication 
action in the domain of time. The purpose of time redundancy is to detect and possibly 
tolerate the occurrence of transient faults. 

The repetition of computation can be used either to compare the results of 
different computations to determine if a discrepancy exists, or to determine if the existing 
discrepancy has been corrected or not. This approach is effective when the fault causing 
the erroneous state is transient. In order to repeat the computation correctly each time, it 
is essential that the same data is provided to the system. However, when the system 
enters into an erroneous state, data used in the computation may be corrupted. If this 
happens, then it becomes impossible to repeat the computation. 

The repetition of the communication action can be used to tolerate transient faults 
resulting in errors in the messages transmitted among the system components. If the 
message is corrupted or lost due to a transient fault, then repeating the message 
transmission most likely will not introduce the same error again. 

C. OBJECTIVES OF FAULT TOLERANCE 

Fault tolerance is an attribute that is designed into a system to achieve some 
design requirements. The significant requirements are dependability, reliability, 
availability, safety, performability, maintainability, and testability. Fault tolerance is one 
system attribute capable of fulfilling such requirements. [Ref. 4] 



22 



1 . 



Dependability 



Dependability is defined as the quality of the delivered service such that reliance 
can justifiably be placed on this service [Ref. 3]. Dependability covers all the measures 
used for quantifying the quality of the delivered service such as reliability, availability, 
safety, maintainability, and testability. 

2. Reliability 

The reliability of a system is a function of time, R(t), defined as the conditional 
probability that the system performs correctly throughout the interval of time, [t 0 ,r], 
given that the system was performing correctly at a time t 0 [Ref. 4]. In other words, 
reliability is a measure of the continuous service accomplishment from a reference initial 
instant. If the lifetime of a system is exponentially distributed, then the reliability of that 
system is: 

R{t) = e~ M (2.1) 

The parameter A is called the failure rate of the system, and is defined as the 
expected number of failures of a system per unit of time. The commonly accepted 
relationship between the failure rate and time for electronic components is called the 
bathtub curve, and illustrated in Figure 2.8. The decreasing section of the bathtub curve is 
called the infant mortality phase. The increasing section of the bathtub curve is called the 
worn-out phase. In this region, failures begin to appear and increase rapidly due to the 
physical wearing of electronic components. The intermediate phase is called the useful 



23 



phase of the component. During this phase the failure rate is assumed to be constant, 
which is the A value explained above. 




Figure 2.8. Bathtub Curve Relationship Between Failure Rate and Time. [From Ref. 4] 

The exponential relationship between the reliability and the time is known as the 
exponential failure law. The exponential failure law is extremely valuable for the analysis 
of electronic components, and is by far the most commonly used relationship between 
reliability and time. [Ref. 4] 

When the exponential failure law is applied to a system, the life of the system is 
assumed to be exponentially distributed. With this assumption, the Mean Time to Failure 
(MMTF) (or expected life) of the system can be calculated with the following equation: 

MTTF =r- ( 2 - 2 ) 

A 



24 



3. Availability 



Availability is a function of time, A(t), defined as the probability that a system is 

operating correctly and is available to perform its functions at the time t . Availability is 
related to, but different than reliability. Reliability measures how frequently the system 
fails, whereas availability measures the percentage of time the system is in its operational 
state. When the mean time to failure of a system is represented as MTTF, and the Mean 
Time to Repair of the failed system is represented as MTTR, then the availability, a , is 
calculated as follows: 



MTTF 

~ MTTF + MTTR 



(2.3) 



System availability is frequently classified by measuring the percent of time in 
which the system is available. Table 2.1 shows these common classes and the associated 
availability percentages and related annual downtime. Systems are characterized as 
having a certain number of “9” s (e.g., “five nines system”) or as being a certain 
availability class (e.g., “Class 5”) according to the band of availability it achieves. A 
Class 5 system, for example, has 99.999% - 99.9999% availability. 



25 



AVAILABILITY 

MEASUREMENT 


ANNUAL OUTAGE 


AVAILABILITY 

CLASS 


90% 


More than a month 


One nines 


Class 1 
Class 2 
Class 3 
Class 4 
Class 5 
Class 6 


99% 


Just under four days 


Two nines 


99.9% 


Just under nine hours 


Three nines 


99.99% 


About an hour 


Four nines 


99.999% 


A little over five minutes 


Five nines 


99.9999% 


About half a minute 


Six nines 


99.99999% 


About three seconds 


Seven nines 



Table 2.1. Availability Classes. [From Ref 10] 



4. Safety 

Safety, S(t), is the probability that a system will either perform its functions 
correctly or will discontinue its functions in a manner that does not disrupt the other 
systems or compromise the safety of any people associated with the system. Safety is a 
measure of the fail-safe capability of the system; if the system does not operate correctly, 
it is desirable to have the system fail in a safe manner. [Ref. 4] 

In order to increase the safety of a system, the likelihood of undetected error in 
the output should be made negligible so that when an uncorrectable error in the output is 
detected, it is possible to carry out the safe failure of the system. 

Although the concept of safety seems similar to that of reliability, they are in fact 
different. Reliability is the probability that a system will perform its functions correctly, 
whereas safety is the probability that a system will either perform its functions correctly 



26 




or will discontinue the functions in a harmless manner. It may be noted that when a 
system is reliable, it is also safe. However the reverse is not always true. 

5. Performability 

The performability of a system is a function of time, P(L,t), defined as the 
probability that the system performance will be at or above, some level, L, at the instance 
of time, t. Performability differs from reliability in that reliability is a measure of the 
likelihood that all of the functions are performed correctly, while performability is a 
measure of the likelihood that some subset of functions is performed correctly. [Ref. 4] 

Graceful degradation, which refers to the ability of a system to automatically 
decrease its performance level, is an important system feature closely related to 
performability. Fault tolerance can support graceful degradation and performability by 
neutralizing the effects of hardware and software faults of a system, thereby allowing 
performance at some reduced level. [Ref. 4] 

6. Maintainability 

The maintainability is a measure of the ease with which a system can be repaired 
once it has failed. In other words, maintainability is the probability, M{t), that a failed 
system will be restored to an operational state within a period of time, t. [Ref. 4] 

Maintainability encapsulates not only the failures of the system, but also the 
modifications that are necessary for the required level of system performance. In order to 
keep a system in a state that is relevant to its users, it is mandatory to repeatedly modify 
and enhance the system functions. The ease with which such modifications can be 



27 



performed is dependent on the modularization of the system. If the consequences of a 
modification can be localized to well-defined small modules, then the maintenance effort 
can be minimized. 

Fault tolerance can support maintainability in the problem detection and problem 
location process. Once the problem is detected and located maintenance can be 
performed. Fault tolerance can also support maintainability in the modification process 
by allowing maintenance actions without interrupting the service delivered by the system. 

7. Testability 

The testability is simply the ability to test for certain specifications of the system. 
In order to improve the testability, certain tests can be automated and integrated into the 
system. Fault tolerance techniques can be used to detect and locate the problems for the 
purpose of improving testability. [Ref. 4] Since a system must be retested after every 
modification or fix, testability is closely related to maintainability. 

D. PHASES IN FAULT TOLERANCE 

The implementation of fault tolerance in a particular system is dependent upon the 
system itself, and differs from one system to another. Every system requires a different 
type of implementation of fault tolerance depending on its needs, functionality, and 
architecture. Therefore it is very difficult to propose a general technique for adding fault 
tolerance to a system. However, there are some general actions that systems must 
perform during the implementation of fault tolerance. These actions can be classified 



28 



according to the phase in which they occur: error detection , damage confinement, error 
correction, and fault treatment and continued system service. 

1. Error Detection 

Before starting any fault tolerance activity, an error must be detected. Presence of 
the fault and failure cannot be observed directly. Since the error is defined as a state of a 
system, the presence of error can be detected by checking the system’s states. Afterwards, 
the presence of failures and faults can be deduced from the detected error in the system. 
The effectiveness of the fault tolerance implementation depends directly on the 
effectiveness of the error detection mechanism employed. 

There are some important properties that an ideal error detection mechanism 
should satisfy. First, an ideal check should be determined solely from the specifications 
of the system, and should not be influenced by the internal design of the system. Any 
influence of the system on the check can cause the same error in the check as is present in 
the system. For that reason, while designing an error detection mechanism, the system 
should be treated as a “black box”. [Ref. 1] 

Secondly, the error detection mechanism should be complete and correct. All 
possible enrors should be detected, and all declared errors should actually be present in 
the system. In other words, the detection mechanism should prevent false alarms. 

Thirdly, the check should be independent from the system with respect to 
susceptibility of faults. If the detection mechanism fails when the system fails, then the 
check is useless. The detection mechanism should have a different failure mode than the 



29 



system. This minimizes the probability that the detection mechanism will fail at the same 
time as the system. 

Implementation of real detection mechanisms rarely satisfies all criteria explained 
above, due to their complexity, impracticality or cost. Therefore, instead of ideal checks, 
acceptable checks are used for error detection in real designs. An acceptable check is an 
approximation performed by ignoring errors that rarely occur. This type of checking 
design aims to lower the cost of implementation, and at the same time tries to maximize 
errors detected. 

In computer systems, different types of error detection mechanisms are employed 
depending on the system and the errors to be detected. In the following sections, some 
general types of checks that are most frequently employed in computer systems will be 
discussed. 



a. Replication Checks 

In this type of check, some components of the system are replicated as 
many times as needed depending upon the application, and then the results of these 
components are compared or voted to detect the errors. Since replication checks involve 
replication of the system components, it is one of the most expensive methods of error 
detection. However, this type of check can be fairly complete and can be implemented 
without the internal design information of the system being replicated. 



30 



b. Timing Checks 

Timing checks are used for controlling the time-related constraints of the 
system in order to see if those constraints are being met. Usually a timer is set to a value 
according to the system’s specifications. Expiration of the timer indicates that the time- 
related constraints of the system are violated. Timing checks are used in both hardware 
and software systems. 

Timing-related errors are very important, especially in distributed systems. 
In most distributed systems, a working node must respond within some pre-determined 
time to show that it is up and running. If a node fails to respond within the timeout 
period, then its failure is declared. This is the most common method of detecting node 
failures. 



c. Structural Checks 

When data is the primary concern of the fault tolerance, structural checks 
are used to detect errors. In structural checks, redundancy of information added to the 
data to be protected can be used for detecting the errors. 

Structural checks are mostly used in hardware in a process called coding. 
In the coding mechanism, some extra bits are added to the actual data bits. These coding 
bits are calculated according to relationship rules based on the value of data bits. The 
structural check mechanism recalculates coding bits and compares them with the existing 
ones. When the coding bits or the data bits are corrupted, newly calculated coding bits 
will not match the old ones, and thus the error will be detected (e.g., digital signatures.) 



31 



d. Reasonableness Checks 

Reasonableness checks determine if the state of some object in the system 
is “reasonable.” Reasonableness checks can be implemented either by checking the range 
or the rate of change. The range checks attempt to determine if a certain value is within a 
specified range. The rate of change checks attempt to determine if the rate of change of a 
certain value is within specified bounds. 

e. Diagnostics Checks 

Diagnostic checks are implemented by employing special input values to a 
system, whose output values are known. These types of detection mechanisms are usually 
built into the system and are used for the system’s initial self-checking process. 

2. Damage Confinement and Assessment 

The main goal of the damage confinement and assessment phase is to determine 
the boundaries of corruption before starting the error recovery process. During the time 
delay between the failure and the event of error detection, an error may propagate and 
spread to other parts of the system. 

The main reason for the error propagation is that the communication takes places 
among the system components. For that reason, the information flow between the 
components of the system has to be examined after the error is detected in order to assess 
the extent of the damage. The goal is to identify a state in which no information exchange 
has occurred. Then the damage could be limited to this boundary. 



32 



3 . 



Error Correction 



After the error is detected and the damage is assessed, the erroneous state of the 
system should be corrected. This correction can be made using a process called effective 
error processing. Effective error processing refers to the correction made after an error 
has taken effect. The goal of effective error processing is to bring the effective error back 
to a latent state, and before occurrence of failure if possible. Effective error processing 
may take two forms: error recovery and error masking. 

a. Error Recovery 

The error recovery mechanism typically denies the service request and sets 
the system to an error-free state so that it can service subsequent requests. Error recovery 
can be classified as backward error recovery and forward error recovery. 

In backward error recovery, when the error is detected, the system is 
restored to previously occupied (correct) state prior to the error becoming effective. In 
this method, states of the system are periodically checkpointed on some stable storage 
that would not be affected by a failure. When the error is detected, the system is rolled 
back to the last checkpointed state, which is assumed to be error free. It is very suitable 
for transient faults, because restarting the system from the last checkpointed state will not 
introduce the error again. Since checkpointing has to be performed periodically on a 
stable storage, the backward error recovery technique introduces a great amount of 
overhead to the system. However, due to its simplicity, the backward error recovery 
mechanism is the most commonly used error recovery technique. 



33 



In forward error recovery, instead of rolling back, the system is set to a 
new error-free state (one not previously occupied) by taking the necessary corrective 
actions. In order to decide on the necessary actions, the exact nature of the error has to be 
known, and the exact damage has to be determined. These qualifying characteristics 
make the forward error recovery technique very difficult to implement. 

b. Error Masking 

In error masking, the erroneous state of the system contains enough 
redundancy to enable the delivery of an error-free service from the erroneous internal 
state. Examples of error masking are the error-correcting codes used for electronic, 
magnetic, and optical storage. Additionally, NMR technique, discussed previously in the 
passive hardware redundancy section, can be given as an example of error masking. 

4. Fault Treatment and Continued Service 

Unlike the first three phases, this phase does not deal with errors. Faults are the 
main focus of the fault treatment and continued phase. If the fault is transient, then when 
the system is restarted from the error free state, the same problem will not occur again. 
However, if the error is caused by a permanent fault, then restarting the system from the 
error-free state will cause the same error again. Thus, the identification of the faulty 
component and its exclusion from the computation performed after recovery is essential. 
This phase can be divided into two phases. These sub-phases are known as the fault 
location phase and the system repair phase. 



34 



In the fault location phase, the component of the system containing the fault is 
identified. In the system repair phase, the located faulty component is repaired. This 
repair can be done on-line and without manual intervention. 

When the system repair phase is completed, the system can continue to provide its 
services again. The overall effect of fault tolerance phases on the system would take the 
form of a minor discontinuity in service or some performance degradation. 



35 



THIS PAGE INTENTIONALLY LEFT BLANK 



36 



III. FAULT TOLERANCE IN WINDOWS NT OPERATING SYSTEM 

The SAAM server runs as an application and builds a Path Information Base 
(PIB) on the volatile memory to support QoS routing. Specifically, the server identifies 
those paths or sub paths that can potentially be used to route flows, and maintains up-to- 
date performance parameters for each of them. The server computes path performance 
parameters by aggregating link level performance data passed up from each router. 

The SAAM server is currently prototyped as a Java application on the Windows 
NT Server 4.0 operating system environment. When choosing Windows NT Server, 
influencing factors included reliability, scalability, stability, speed, and ease of 
administration. 

Since the SAAM project is currently prototyped in the Windows NT operating 
system environment, it is essential to be aware of the provided fault tolerance related 
features with this operating system to not reinvent the wheel. Therefore, this chapter will 
focus on the Windows NT server operating system features that support fault tolerance. 

The Windows NT Server 4.0 includes the following fault tolerance related 
features: 

• Error handling and protected subsystems 

• NT File System 

• Automatic restart 

• Tape backup support 

• Uninterruptible power supply (UPS) support 

• Faul t Tol erant S torage 

37 



In addition to these features, the Windows NT Server Enterprise Edition offers the 
following services: 

• Microsoft Cluster Server (MSCS) 

• Windows NT Load Balancing Service (WLBS) 

A. ERROR HANDLING AND PROTECTED SUBSYSTEMS 

Software applications do not always operate as expected; they can enter into 
faulty states. Windows NT Server is designed to tolerate these faults by ensuring that 
they do not affect other components of the operating system. For the Windows NT 
Server, the first line of defense against software errors is its structured method of 
exception handling. When an abnormal event occurs, the event is captured and either the 
processor or the operating system issues an exception. This design • ensures that no 
undetected error is allowed to influence the system or other user programs. [Ref. 5] 

Windows NT Server also employs protected subsystems in its design. Protected 
subsystems are separate, unique memory locations that are assigned to different processes 
and applications. By isolating programs in this way, Windows NT Server ensures that a 
program fault will not affect the system’s kernel and, as a result, crash the operating 
system. Similarly, programs are isolated from each other so that when a program faults, it 
does not adversely affect other programs running on the system. This architecture makes 
it safe to deploy new Windows NT Server-based applications. New applications can be 
run and tested on a Windows NT Server-based machine without concern that they will 
adversely affect the system or other production applications. As a result, deploying 



38 



powerful, new server-based applications on Windows NT Server is less risky than it is 
with some other server operating systems. [Ref. 5] 

B. NT FILE SYSTEM (NTFS) 

NTFS is a recoverable file system that uses caching and allows recovery from a 
disk failure. NTFS helps guarantee the consistency of the disk volume by using standard 
transaction logging and recovery techniques, although it does not guarantee the protection 
of user data. All data is accessed via the file cache. While the user searches folders or 
reads files, data to be written to disk accumulates in the file cache. If the same data is 
modified several times, all those modifications are captured in the file cache. The result is 
that the file system needs to write to a disk only once to update the data. [Ref. 6] 

When a disk error occurs during a write operation, NTFS is capable of 
automatically re-mapping the bad sector, and allocates a new cluster for the data. The 
following section discusses how Windows NT automatically recovers from some kinds of 
disk errors. Windows NT provides two kinds of disk error recovery: dynamic data 
recovery by using sector sparing and NTFS cluster remapping. 

Dynamic data recovery by using sector sparing is only available on SCSI disks 
that are configured as part of a fault-tolerant volume. Sector sparing works on fault- 
tolerant volumes because a copy of the data on the sector with the error can be 
regenerated. Windows NT Server obtains a spare sector from the disk device driver to 
replace the bad sector. It then recovers the data by reading the sector from the mirror disk 
or recalculating the data from a stripe set with parity, and writes the data to the new 
sector. This processing is transparent to any applications performing disk input/output 



39 



(I/O) operation. Sector sparing eliminates error messages such as the “Abort, Retry, or 
Fail?” that occur when a file system encounters a bad sector. [Ref. 7] 

NTFS cluster remapping is another disk recovery technique. When Windows NT 
returns a bad sector error to the NTFS file system, NTFS dynamically replaces the cluster 
containing the bad sector and allocates a new cluster for the data. If the error occurred 
during a read on a non-fault-tolerant volume, NTFS returns a read error to the calling 
program, and the data are lost. When the error occurs during a write, NTFS writes the 
data to the new cluster, and no data are lost. NTFS puts the address of the cluster 
containing the bad sector in its Bad Cluster File so the bad sector will not be reused. 
[Ref. 7] 

Windows NT Server provides additional recovery mechanisms for fault-tolerant 
volumes (mirror sets and stripe sets with parity). Table 3.1 summarizes what happens if a 
sector goes bad on a computer running Windows NT Server. 

C. AUTOMATIC RESTART 

The Windows NT operating system includes an automatic restart feature. In the 
event of a failure, the system can be configured to automatically restart itself. This feature 
of Windows NT Server provides maximum system up-time. To assist the administrator in 
determining the cause of the failure, Windows NT Server can be set to transfer its 
memory contents to a disk file before restarting for later analysis. [Ref. 5] 



40 



Description 


Fault-Tolerant 
Disk (FtDisk) 
installed with a 
SCSI disk that has 
spare sectors 


Fault-Tolerant 
Disk (FtDisk) 
installed with a 
non-SCSI disk or 
disk with no spare 
sectors 


Fault-Tolerant 
Disk (FtDisk) not 
installed with 
any kind of disk 




1 . FtDisk recovers 
the data. 


1. FtDisk recovers 
the data. 

2. FtDisk sends the 
data and bad-sector 




Fault-tolerant 


2. FtDisk replaces 


error to the file 
system. 




volume (Windows 


the bad sector. 


N/A 


NT Server only) 


3. File system does 
not know about the 
error. 


3. NTFS performs 
cluster remapping. 

4. FAT doesn’t do 
anything about the 
error. 






1. FtDisk cannot 


1. FtDisk cannot 


1. Disk driver 




recover the data. 


recover the data. 


returns a bad- 
sector error to the 




2. FtDisk sends a 
bad-sector error to 


2. FtDisk sends a 
bad-sector error to 


file system. 


Non-f ault- tolerant 


the file system. 


the file system. 


2. NTFS performs 
cluster 


volume 


3. NTFS performs 


3. NTFS performs 


remapping. On a 


cluster remapping. 


cluster remapping. 


read operation, 




On a read operation, 
data are lost. 


On a read operation, 
data are lost. 


data are lost. 

3. FAT loses the 




4. FAT loses the 


4. FAT loses the 


data on both read 




data on both read 
and write. 


data on both read 
and write. 


and write. 



Table 3.1. What Happens If a Sector Goes Bad? [From Ref. 8] 



41 



D. TAPE BACKUP SUPPORT 



Regular tape backup is an important part of guaranteeing data availability. 
Windows NT Server includes a graphical tool called Backup that makes it easy to backup 
Windows NT Server-based data to tape. Windows NT Backup provides five backup 
types: normal, copy, incremental, differential, and daily copy. Some backup types use 
backup markers, also known as archive attributes, to track when a file has been backed 
up. Table 3.2 describes the backup types. 



Backup Type 


Actions 


Normal (FuU) 


Backs up selected files and marks each as having been backed up. 
With normal backups, one can restore files quickly because files 
on the last tape are the most current. 


Incremental 


Backs up only those files created or changed since the last normal 
or incremental backup. It marks files as having been backed up. 


Differential 


Backs up those files created or changed since the last normal (or 
incremental) backup. It does not mark files having been backed up. 


Copy 


Backs up selected files, but does not mark each file as having been 
backed up. Copying is useful if user wants to back up files 
between normal and incremental backups, because copying does 
not invalidate these other backup operations. 


Daily copy 


Backs up selected files that have been modified the day the daily 
backup is performed. The backed up files aren’t marked as having 
been backed up. 



Table 3.2. Backup Types and Their Functions. [From Ref. 9] 



42 



E. UNINTERRUPTIBLE POWER SUPPLY (UPS) 



The Uninterruptible Power Supply (UPS) service is a system software component 
of Windows NT, which can be configured to detect and warn of impending power failure. 
It has built-in electronics that constantly monitor line voltages. If the line voltage 
fluctuates above or below pre-set limits, or fails entirely, the UPS supplies power to the 
computer system from built-in batteries. UPS Systems provide a hardware interface that 
can be connected to the computer. Using appropriate software, this interface enables an 
orderly handling of the power failure, including performing a system shutdown before the 
UPS batteries are depleted. 

UPS offers significant benefits when considering the fact that power loss accounts 
for almost 27% of all unplanned outages. In some locations, and at certain times of the 
year, power outages can occur as often as once a day. Operators should use redundant 
power supplies for maximum reliability. 

Windows NT has built-in UPS functionality that takes advantage of the special 
features that many UPS systems provide. These features ensure the integrity of data on 
the system and allow the computer system and UPS to be shutdown in a controlled 
manner if a power failure outlasts UPS batteries. In addition, users connected to a 
computer running Windows NT Server can be notified that a shutdown will occur and 
new users are prevented from connecting to the computer. Finally, damage to the 
hardware from a sudden, uncontrolled shutdown can be prevented. [Ref. 6] 



43 



F. 



FAULT-TOLERANT STORAGE 



The term. Redundant Array of Inexpensive Disks (RAID), was first coined by 
Chen, Gibson, Katz, and Patterson of the University of California at Berkeley. [Ref. 16] 
The RAID Advisory Board (RAB) has since re-named the term, replacing “inexpensive” 
with “independent.” 

RAID minimizes loss of data caused by problems with accessing data on a hard 
disk. RAID is a fault-tolerant disk configuration in which part of the physical storage 
capacity contains redundant information about data stored on the disks. The redundant 
information enables regeneration of the data if one of the disks or the access path to it 
fails, or a sector on the disk cannot be read. [Ref. 7] 

Some vendors sell disk subsystems that implement RAID technology completely 
within the hardware. Some of these hardware implementations support hot swapping of 
disks, which enables the user to replace a failed disk while the computer is still running 
Windows NT Server. [Ref. 7] Regardless of their implementation techniques, all RAID 
disk configurations perform the following functions: 

• Regeneration of data to satisfy a read request when a disk or a path to a disk 
has failed. 

• Reconstruction of the missing data onto the new disk when the user has 
replaced the failed disk (or the path to it). 

Normally a RAID set appears as a single large disk drive to applications and the 
operating system, although it is actually an array of drives with equal capacity. RAID 
terminology is standardized by level, as indicated in Table 3.3. 



44 



RAID 

Level 


Functionality 


0 


Data are stripped across available disk drives, to improve access 
times and throughput. There is no redundancy. 


1 


Two disk drives are mirrored (both store the same data), using a 
single disk controller. Data can be read off both drives 
simultaneously (either drive can service any request), providing 
improved performance for reads (but not for writes), and 
redundancy. 


2 


Data are spanned (stripped, bit-by-bit) across multiple disks, and 
additional disks are used to store Hamming codes (to detect and 
correct errors or recover from failed drives). Four data disks 
would require three additional error detection and correction 
disks. 


3 


Data are stripped (sometimes called interleaved) either bit-by- 
bit or (more commonly) byte-by-byte across two or more (four 
is apparently best) data disks (for example, first byte to first 
disk, second byte to next disk, and so on -written in parallel to 
all disks). A parity byte is constructed from the corresponding 
bytes on the data disks and is written to one additional disk, 
which is dedicated as a parity disk. The contents of a failed disk 
can be reconstructed from the other disks. However, the use of a 
single parity disk creates a performance bottleneck. 


4 


Same as RAID 3, but data are stripped (and parity is 
constructed) in disk sectors (which is the smallest unit of disk 
storage allocation) rather than bits or bytes. 


5 


Data are stripped sector by sector across two or more disks. 
Parity information sectors are stripped along with the data on 
each disk, and there is no dedicated parity disk. Since both 
parity and data are stripped, simultaneous writes are possible 
(depending on where the data has to go). 



Table 3.3. RAID Levels. 



Windows NT Server provides a software implementation of disk striping at RAID 
level 0 and disk mirroring at RAID level 1. It also provides an implementation of RAID 
level 5. Cluster server services in the Windows NT Server Enterprise Edition uses RAID 
subsystems exclusively. 



45 




1. Stripe Set 



Stripe sets are composed of stripes of equal size on each disk in the volume. One 
can create a stripe set from equal sized, unallocated areas on two to 32 physical disks. For 
Windows NT Workstation and Windows NT Server, the size of the stripe is 64K. 

Stripe sets do not contain any redundant information. Therefore, the cost per MB 
for a stripe set is identical to that of the same amount of storage configured from a 
contiguous area on a single disk. Although the data are spread across multiple disks, there 
is no fault tolerance. When any disk fails, the whole stripe set fails, and no data can be 
recovered. The reliability for the stripe set is no better than the least reliable disk in the 
set. [Ref. 7] 

A stripe set may be used for performance reasons. Access to the data on a stripe 
set is usually faster than access to the same data on a single disk, because the I/O is 
spread across more than one disk. Therefore, Windows NT can perform doing seeks on 
more than one disk at the same time, and can even have simultaneous reads or writes 
occurring. [Ref. 7] 

2. Mirror Set 

A mirror set provides an identical twin for the selected partition. All data written 

to the mirror set are written to both partitions, which results in disk space utilization of 

only 50 percent. Creating a mirror set is similar to making a copy of a document by using 

a copy machine. The original partition is like the original of the document, and the 

shadow partition is the copy. Unlike a copy machine, however, Windows NT continually 

updates both the original and shadow partitions when any changes are made to the mirror 

46 



set. It is not necessary to use identical physical disks or to have the same partitions on 
each disk, although identical disks should be used if putting the system partition on a 
mirror set. A mirror set requires only sufficient unused space on the second disk to create 
the shadow partition. [Ref. 7] 

If there is a read failure on one of the disks, the fault-tolerant disk driver reads the 
data from the other disk in the mirror set. If there is a write failure on one of the disks in 
the mirror set, the fault-tolerant disk driver uses the remaining disk for all accesses. 
Because dual-write operations can degrade system performance, many mirror set 
implementations use duplexing, where each disk in the mirror set has its own disk 
controller. [Ref. 7] 

3. Stripe Set With Parity 

The parity strip is the exclusive-OR (XOR) of all the data values for the data 
strips in the stripe. If no disks in the stripe set with parity have failed, the new parity for a 
write can be calculated without having to read the corresponding strips from the other 
data disks. Thus, only two disks are involved in a write operation, the target data disk and 
the disk that contains the parity strip. Figure 3.1 shows the steps that are involved in 
writing data to a stripe set with parity. [Ref. 7] 

When implementing a stripe set with parity, there must be at least three disks and 
no more than 32 disks in the set. The physical disks do not need to be identical. However, 
there must be equal size blocks of unpartitioned space available on each physical disk in 
the set. The disks can be on the same or different controllers. As with stripe sets, one 



47 



cannot add disks to a stripe set with parity if one may need to increase the size of the 
volume later. [Ref. 7] 




Time 



Figure 3.1. Steps in Writing Data to a Stripe Set with Parity. 

If one of the disks in a stripe set with parity fails, none of the data are lost. When 
a read operation requires data from the failed disk, the system reads all of the remaining 
good data strips in the stripe and the parity strip. Each data strip is subtracted (with XOR) 
from the parity strip. The result is the missing data strip. [Ref. 7] 

When the system needs to write a data strip to a disk that has failed, the system 
reads the other data strips and the parity strip and then backs them out of the parity strip, 
leaving the missing data strip. The modifications needed to the parity strip can now be 
calculated and made. Only the parity strip is written upon; the data strip is not written 
upon because it is bad. 

There is no effect on a read operation when the failed disk contains a parity strip. 
(The parity strip is not needed for a read unless there is a failure in a data strip.) When the 
failed disk contains a parity strip, the system does not compute or write the parity strip 
when there is a change in a data strip. [Ref. 7] 



48 



G. MICROSOFT CLUSTER SERVER (MSCS) 

1. Overview of Server Cluster 

Clusters of computer systems have been built and used for over a decade. Pfister 
defines a cluster as "a parallel or distributed system that consists of a collection of 
interconnected whole computers, that is utilized as a single, unified computing resource." 
[Ref. 10] The goal of a cluster is to make it possible to share a computing load over 
several systems on a network without either the users or system administrators needing to 
know that more than one system is involved. 

In a cluster, if a certain resource or set of resources goes down, the system 
intelligently chooses where and how to run applications in the network. With clustering, 
one of two nodes can also be used to run certain services while the other is used for 
maintenance. Later, the maintained node can be returned to the cluster without affecting 
services. In short, clustering provides the high availability of a multiple-node network 
with the management simplicity of a single address space. [Ref. 12] 

There are basically three techniques that clusters use to make disk data available 
to more than one server: 

• Shared disks: In the shared disk model, software running on any system in 
the cluster may access any resource (e.g., a disk) connected to any system in 
the cluster. If two systems need to see the same data, the data must either be 
read twice from the disk or copied from one system to another. [Ref. 1 1] 



49 



• Mirrored disks: A more flexible alternative is to let each server have its own 
disks, and to run software that "mirrors" every write from one server to a copy 
of the data on at least one other server. This technique can be used for keeping 
the data at a disaster recovery site in synch with that of a primary server. 

• Shared nothing: In the shared nothing software model, each system within 
the cluster owns a subset of the cluster’s resources. Only one system may own 
and access a particular resource at a time. However, upon failure, another 
dynamically determined system may take ownership of the resource of the 
failed system. In addition, requests from clients are automatically routed to the 
system that owns the resource. For example, if a client request requires access 
to resources owned by multiple systems, one system is chosen to host the 
request. The host system analyzes the client request and ships subrequests to 
the appropriate systems. Each system executes the sub-request and returns 
only the required response to the host system. The host system assembles a 
final response and sends it to the client. [Ref. 14] 

2. MSCS 

MSCS, also known as “Wolfpack”, is a built-in feature of the Windows NT 
Server, Enterprise Edition. It is software that supports the connection of two servers into 
a "cluster" for higher availability and easier manageability of data and applications. 
MSCS can automatically detect and recover from server or application failures. It can be 
used to move server workload to balance utilization and to provide for planned 
maintenance without downtime. [Ref. 13] 



50 



MSCS uses software "heartbeats” to detect failed applications or servers. In the 
event of a server failure, it employs a "shared nothing" clustering architecture that 
automatically transfers ownership of resources (such as disk drives and IP addresses) 
from a failed server to a surviving one. It then restarts the failed server’s workload on the 
surviving server. If an individual application fails (but the server does not), MSCS will 
typically try to restart the application on the same server. If that fails, the MSCS moves 
the application’s resources and restarts the same application on the other server. The 
cluster administrator can use a graphical console to set various recovery policies, such as 
dependencies between applications, whether or not to restart an application on the same 
server, and whether or not to automatically "failback7" (rebalance) workloads when a 
failed server comes back online. Generic MSCS architecture is shown in Figure 3.2. 



Client PCs 




Figure 3.2. A Generic MSCS Setup. [After Ref. 12] 



51 




Figure 3.3 shows an overview of the components and their relationships in a 
single system of a Windows NT cluster. Microsoft Cluster Server mainly comprised of 
three key components: 

• The Cluster Service 

• The Resource Monitor 

• Resource and Cluster Administrator Extension DLLs 

The Cluster Service (which is composed of the Event Processor, the Failover 
Manager/Resource Manager, the Global Update Manager, the Communication Manager, 
the Checkpoint Manager, and Membership Manager) is the core component of MSCS and 
runs as a high-priority system service. The Cluster Service controls cluster activities and 
performs such tasks as coordinating event notification, facilitating communication 
between cluster components, handling failover operations, and managing the 
configuration. Each cluster node runs its own Cluster Service. [Ref. 32] 

The Resource Monitor is an interface between the Cluster Service and the cluster 
resources, and runs as an independent process. The Cluster Service uses the Resource 
Monitor to communicate with the resource DLLs. The DLL handles all communication 
with the resource, thus shielding the Cluster Service from resources that misbehave or 
stop functioning. Multiple copies of the Resource Monitor can be running on a single 
node, thereby providing a means by which unpredictable resources can be isolated from 
other resources. [Ref. 32] 

The Resource Monitor and resource DLL communicate using the Resource API, 
which is a collection of entry points, callback functions, and related structures and 
macros used to manage resources. Applications that implement their own resource DLLs 



52 



to communicate with the Cluster Service and that use the Cluster API to request and 
update cluster information are defined as cluster-aware applications. Applications and 
services that do not use the Cluster or Resource APIs and cluster control code functions 
are unaware of clustering and have no knowledge that MSCS is running. These cluster- 
unaware applications are generally managed as generic applications or services. [Ref. 32] 




to other nodes 



Figure 3.3. MSCS Components on a Single Windows NT Server. [From Ref. 32] 



MSCS can reduce planned and unplanned downtime. However, even with MSCS, 
a server could still experience downtime from the following events: 



53 



• MSCS failover time: If MSCS recovers from a server or application failure, or 
if it is used to move applications from one server to another, the application(s) 
will be unavailable for a non-zero period of time. 

• Failures from which MSCS cannot recover: There are types of failure that 
MSCS does not protect against, such as loss of a disk not protected by RAID, 
loss of power when a UPS is not used, or loss of a site when there’s no fast- 
recovery disaster recovery plan. Most of these can be survived with minimal 
downtime, however, if precautions are taken in advance. 

• Server maintenance that requires downtime: MSCS can keep applications and 
data online through many types of server maintenance, but not in all 
circumstances. For example, two such circumstances occur when completely 
upgrading both servers in a cluster, or installing a new version of an 
application which has a new on-disk data format that requires reformatting 
preexisting data). 

MSCS does not require any special software on client computers, so the user’s 
experience during failover depends on the nature of the client side of their client-server 
application. Client reconnection is often transparent, because MSCS has restarted the 
applications, file shares, etc., at exactly the same IP address. [Ref. 13] 

If a client is using "state-less" connections such as a standard browser connection, 
then client would be unaware of a failover if it occurred between server requests. For 
client-side applications that have "state-full" connections to the server, a new logon is 



54 



typically required following a server failure. In many cases, this approach is required for 
security purposes. [Ref. 13] 

The servers in an MSCS cluster cannot be located at separate locations for 
recovery from site disasters. All of the cluster configurations currently being considered 
for validation use SCSI connections to storage resources, which limits the distance 
between clustered servers to the distance supported by standard SCSI. This is typically no 
more than 25 meters. [Ref. 13] There are three types of server applications that will 
benefit from MSCS clusters: 

1. "In the box" services of Windows NT Server, Enterprise Edition: File shares, print 
queues, Intemet/intranet sites managed by the Microsoft Internet Information 
Server, Windows NT Server’s built-in Web server; Microsoft Message Queue 
Server (MSMQ) services, and Microsoft Transaction Server (MTS) services, both 
of which are part of Windows NT Server. 

2. Generic applications: MSCS includes a point-and-click wizard for setting up any 
well-behaved server application for basic error detection, automatic recovery, and 
operator-initiated management. A "well behaved" server application is one that 
keeps a recoverable state on shared SCSI disk(s), and whose client can gracefully 
handle a pause in service of up to a minute as the application is automatically 
restarted by MSCS. 

3. Cluster-aware applications: Software vendors will test and support their 
application products on MSCS. Over time, vendors will provide MSCS-based 
enhancements, from simpler setup and faster failover to cluster-enabled scalability 
and load balancing. 



55 



H. WINDOWS NT LOAD BALANCING SERVICE (WLBS) 



Microsoft Windows NT Load Balancing Service (WLBS), a feature of Windows 
NT Server 4.0, Enterprise Edition, provides scalability and high availability to enterprise- 
wide Transmission Control Protocol/Intemet Protocol (TCP/IP) services, such as Web, 
proxy. Virtual Private Networking (VPN), and streaming media services. WLBS is based 
on the Convoy Cluster Software by Valence Research, Inc., a recent Microsoft 
acquisition. [Ref. 15] 

The two principal goals of Microsoft Windows NT Load Balancing Service 
(WLBS) are to provide high availability for Internet server programs and to ensure scale 
server performance. It accomplishes these goals by using a cluster of two or more 
computers (called hosts) working together, as shown in Figure 3.4. WLBS installs as a 
standard Windows NT networking driver and runs on an organization’s existing LAN. 
All servers within a cluster are placed on a single subnet. Internet clients access the 
cluster using a single IP address (or a set of addresses for a multi-homed host). The 
clients cannot distinguish the cluster from a single server. [Ref. 17] 




Cluster Node Cluster Node 




Figure 3.4. An Example Configuration of WLBS. [From Ref. 15] 



56 




1. WLBS Features 

WLBS cluster servers emit a “heartbeat” to other nodes in the cluster, and listens 
for the heartbeat of other nodes. When a node fails or goes offline, WLBS automatically 
reconfigures the cluster to direct client requests to the remaining computers. In addition, 
for load-balanced ports, the load is automatically redistributed among the computers still 
operating, and ports with a single server have their traffic redirected to a specific host. 
While connections to the failed or offline server are lost, the offline computer can 
transparently rejoin the cluster and regain its share of the workload once the necessary 
maintenance is completed. In addition, WLBS handles inadvertent subnetting and 
rejoining of the cluster network. [Ref. 15] 

WLBS load-balances incoming TCP/IP traffic across all the hosts in a cluster to 
scale performance. Running a copy of the server program on each load-balanced host 
enables the load to be partitioned among them in any manner. WLBS transparently 
distributes the client requests among the hosts and lets the clients access the cluster using 
one or more “virtual” IP addresses. Up to 32 hosts may operate in each cluster, and hosts 
may be added transparently to a cluster to handle increased load. WLBS can also direct 
all traffic to a designated single host, called the default host. 

Load balancing can be specified for a single DP port or group of ports using port 

management rules that tailor the workload for each service. In addition, optional support 

for client sessions can be enabled, as well as optional port rules to let all client requests 

be directed to a single host to further refine load balancing among different applications. 

Undesired network access can also be blocked to certain IP ports. WLBS logs all actions 

and cluster changes to the Windows NT event log. [Ref. 15] 

57 



Administrators can remotely control WLBS actions from any networked 
Windows NT-based computer using console commands or scripts. All cluster hosts can 
be controlled with one command, or controlled individually. The control program has 
fully encrypted password protection to prevent unauthorized access. [Ref. 15] 

No special hardware is needed to interconnect cluster hosts; the cluster may 
exchange status messages over a single local area network using Ethernet (10, 100, or 
gigabit) or FDDI adapter cards. WLBS also lets clients access the cluster with a single 
Internet logical name and IP address, while retaining individual names for each computer. 
In addition, server applications need not be modified to run in a WLBS cluster, and all 
operations, including recovery, require no human intervention. Computers can also be 
taken offline for preventive maintenance without disturbing cluster operations. [Ref. 15] 
WLBS supports state-full client sessions and Secure Sockets Layer (SSL). If a 
server application (such as a Web server) maintains state information about a client 
session that spans multiple TCP connections, it is important that all TCP connections for 
this client be directed to the same cluster host. Should a server or network failure occur 
during a "stateful" client session, a new logon may be required to re-authenticate the 
client and re-establish session state. WLBS also allows modification of session support to 
direct all client requests from an IP Class C address range to a single cluster host. This 
feature ensures that clients who use multiple proxy servers to access the cluster will have 
their TCP connections directed to the same cluster host. [Ref. 15] 



58 



2 . 



WLBS Shortcomings 



If one server fails, the WLBS detects the server failures within 10 seconds, but it 
doesn’t switch a failed servers active connections to other servers. [Ref. 28] Therefore, 
active connections to a failed server are lost when the server goes off line; all other 
connections are not unaffected. Another drawback of the WLBS is its lack of data 
replication mechanism. Since the WLBS works at the network driver level and can’t 
replicate data, it is only suitable for stateless services such as WEB server farm 
environment. 



59 



THIS PAGE INTENTIONALLY LEFT BLANK 



60 



IV. LOCAL AREA FAULT TOLERANCE FOR SAAM SERVER 



Fault tolerance for the SAAM server will be implemented in two phases: locally 
and remotely. The first phase, local area fault tolerance for the SAAM server, is' focuses 
mainly on tolerating component failures of one server such as processor failure, disk 
failure and network interface card failure. The second phase, remote area fault tolerance 
(disaster recovery) for the SAAM server, backup servers are used to tolerate 
environmental faults such as fire, earthquake, and flood that cause unrecoverable server 
failures. The function of the second phase is to tolerate the failure of the local area fault 
tolerance implementation of the SAAM server. Disaster recovery for the SAAM server 
will be discussed in the next chapter. 

Local area fault tolerance for the SAAM server will be implemented by one of the 
existing third party products. As the Windows NT operating system is more commonly 
used in the mission critical applications, many commercial companies are focusing on the 
enhancement of fault tolerance features for this operating system. Nowadays there are 
dozens of such products. In order to select a product that best meets the SAAM server 
fault tolerance requirements, we have examined the five most promising products. The 
findings are presented in this chapter. 

A. PRODUCTS OVERVIEW 

According to recent studies on products providing fault tolerance for Windows 
NT, the following five products are the most promising: 

• ARCserve Replication 4.0 for Windows NT 

61 



• CO-StandbyServer 4.2 for Windows NT 

• Double-Take 3.0 

• Endurance 4000 

• Octopus 3.2 

1. ARCserve Replication 4.0 for Windows NT 

ARCserve Replication is a software product developed by Computer Associates 
(www.cai.com) to provide server resilience for the Microsoft Windows NT operating 
system. ARCserve Replication allows servers to be loosely coupled using existing 
network connections and requires no special-purpose hardware. [Ref. 25] ARCserve 
Replication has the following three components: 

1. Server component. This component installs the ARCserve Replication Server 
service. This enables the computer to act as a primary or secondary server. 
The Server component must be installed on each computer that needs to be 
protected and on each server that will provide protection. 

2. Manager component. This installs ARCserve Replication Manager. This is the 
user interface for the ARCserve Replication. It enables the user to set up 
server protection, monitor the progress of replication tasks, request a manual 
failover, and reinstate a failed server. These tasks can be performed for all 
ARCserve Replication servers from any computer running the Manager 
component. At least one Manager component must be installed. 

3. Alert component. This installs the Alert notification system, and warns user 
whenever replication events (such as failovers) occur. 



62 



In order to provide the secondary server with up-to-date files, a process called 
synchronization process has to be performed. During the synchronization process, those 
parts of the primary file system that constitute the workload are accurately mirrored on 
the secondary server. This initial synchronization process typically requires large 
amounts of data to be transferred, which can be time-consuming [Ref. 25]. However, 
ARCserve Replication lets the primary and secondary servers continue to operate while 
synchronization takes place. Files that are open and in use can also be replicated. 

When synchronization is complete, the secondary contains an up-to-date set of 
replicated files. When files are added or removed from a backed- up directory on the 
primary server, the changes are automatically replicated in nearly real-time on the 
secondary server. If a file is altered slightly, only the changes to the file are sent over the 
network to conserve bandwidth. 

While changes are transmitted automatically between servers, the information is 
only committed to disk when a transaction is completed. This prevents database 
corruption should failover occur in midtransaction. The replication process takes place in 
the background and does not require client systems to close files. All data remains live 
and available during synchronization. Both FAT and NTFS are supported and the file 
systems on the two servers do not need to be the same replication. ARCserve Replication 
runs on existing hardware, provided that there is sufficient disk space for the replicated 
data. 

To protect files on a server, a replication task must be set. The replication task 
defines the data to be protected, the secondary server to hold the replicated data, and the 
conditions that trigger a failover. The replication task also defines an alternate 



63 



identification for the primary server because the secondary server adopts the primary 
server identity while it is standing in for the primary. During this set up process, the user 
can also specify the level of protection desired (data replication with failover or data 
protection only without failover) and the speed of the underlying network (very fast, fast, 
slow, or very slow). 

When the user sets up a replication task, certain conditions apply for clients if the 
primary and secondary servers are not in the same domain. Specifically following a 
failover, the clients can only access replicated data on the stand-in server if their accounts 
for the primary server also exist on the stand-in server. If the primary and the secondary 
servers are in the same domain, then user accounts automatically exist on both servers. 
[Ref. 24] 

During normal operation, ARCserve Replication continuously monitors the state 
of the primary server via TCP/IP or EPX/SPX heartbeat messages over the network, 
looking for conditions that can cause it to initiate a failover. These conditions are as 
follows: 

• Permanent loss of contact with the primary server 

• Critically low disk space on the primary server 

• On command from the system administrator 

Permanent loss of contact with the primary server can have a variety of causes, 
including hardware and software crashes, power outage, and network malfunction. The 
secondary server monitors a regular “heartbeat” sent out by the primary. If a number of 
consecutive heartbeats are missed, ARCserve Replication assumes loss of contact (the 
precise period can be configured as it depends, among other things, on the network 



64 



speed). Optionally, it can then obtain more detail by using an independent serial 
connection between the primary and secondary servers, and by “pinging” other 
preselected network nodes. In this way, ARCserve Replication obtains sufficient 
information to determine whether loss of the heartbeat (as detected by the secondary) is 
due to genuine failure of the primary. This information, in conjunction with a policy set 
by the system administrator, helps ARCserve Replication to assess the severity of the 
failure, and whether or not to fail over. [Ref. 25] 

The system administrator can configure the drives to be monitored, and the level 
of free space that is to be regarded as critical. ARCserve Replication can also monitor 
free disk space levels on the secondary server, and raise an alert and/or suspend the 
replication task if critical levels are reached. [Ref. 25] 

The feature of failover on command from the system administrator is useful if 
planned maintenance or upgrade is required. The primary server can be taken out of 
service with little or no disruption. [Ref. 25] 

ARCserve Replication offers a set of script files that execute automatically at the 
stages during failover. These script files can be used on both the primary server and the 
secondary server, before and after failover 

During failover, the secondary server inherits the IP address and the NetBIOS 
name originally owned by the primary. In addition to this inheritance, the primary is 
given a temporary new address and a name to avoid conflicts in the network. The net 
effect of the failover is that in most case users and applications continue functioning and 
without interruption or requiring a new login. Some older applications that are not 



65 



designed for resilience may require the user to retry a file operation before continuing. 
[Ref. 25] 

When the failed primary server has been repaired, the user runs the reinstate 
wizard. This wizard is not triggered automatically; the user can run it manually or 
schedule it to start at a specific time. The main steps during the reinstating process are as 
follows: 

1 . Optionally issue a warning message to logged-in users. 

2. Synchronize the primary file system with the secondary so that changes made 
while the secondary was standing in are not lost. 

3. Execute pre-reinstatement scripts on both the primary and the secondary 
server. These scripts might typically be used for closing down services, 
modems, etc. 

4. Restore file shares and IP addresses to the primary server. 

5. Execute post-reinstatement scripts on both primary and secondary servers, 
possibly to start up the services needed for users and applications. 

6. Optionally restart the replication task, for continued protection. 

2. Co-StandbyServer 4.2 for Widows NT 

Co-StandbyServer for Windows NT was originally developed by Vinca 
Corporation, which was acquired by Legato Systems (www.legato.com) on August 2, 
1999. Co-StandbyServer is a clustering solution for Windows NT servers. Using Co- 
StandbyServer, one can couple two Windows NT servers together to form a cluster. As 



66 



shown in Figure 4.1, in a typical configuration two servers are connected with a separate, 
dedicated network segment. 

Active-active and active-passive configurations are two possible configuration 
types of the Co-StandbyServer. In active-active configuration, each server is active on the 
network performing file and print functions and/or acts as an application server. 
Conversely, in active-passive environment, the second server does not perform any 
service on the network until the primary server’s failure. 



— V 




Server 1 






Network Server 2 









Dedicated Link 








^ 




Boot Drive > 




App. Module 


\ Boot Drive 


C:/ 






[Envlrorwnant] 


-J 


M ironed 






Mirroring 


Mirrored 


Drives 






[Data] 


Drives 



Figure 4.1 Typical Co-StandbyServer Configuration. 



During normal operation, a continuous bi-directional mirroring process sends 
data across the dedicated network segment ensuring that each server is kept up-to-date 
with data sets from both servers. Should either server fail, Co- S tan dbyS erver transfers 
critical functions from the failed server to the surviving server. This includes IP 
addresses, shares, print queues, server names and applications that were running on the 
failed server. Data that was mirrored from the failed server is now made available to the 
network through the surviving server. At the conclusion of the failover process all critical 



67 




network functions are now active on the surviving server and users can continue to access 
those functions with little or no interruption. 

In Co-StandbyServer, only two servers can exist within a cluster. The two servers 
do not need to be identical. However, the two servers must have the same role in the 
network and must be members of the same domain. 

During the installation process, Co-StandbyServer’s setup installs five services 
and three device drivers. 

Services: 

1. Disk Service: Exports disk devices, which are redirected to the other server 
when a failover occurs. 

2. Event Manager. Receives all errors and alerts for all Co-StandbyServer 
services, devices, and drivers. These errors and alerts are logged in the 
applications Event Log and a text log. 

3. Vinca Service : Monitors communication between the servers and controls the 
failover and fallback process. 

4. Transport Service : Provides communication service for the dedicated link 
between mirroring engines. 

5. Watch Service : Watches changes in the Registry for clustered applications in 
order to replicate those changes to the other server. 

Device Drivers: 

1. VincaFT : Mirroring driver. 

2. VNCDisk: Imports the disk devices that have been exported by the Co- 

StandbyServer Disk Service on the other server. 



68 



3. VNCHint: Provides transport services, which both the VincaFT and VNCDisk 
use to communicate between servers. 

Installation of the Co-StandbyServer components requires installing Windows NT 
on a separate drive, or logical drive within an array. Additionally, each server must have 
three physical drives (or three logical drives configured in a disk array) for an active- 
active configuration, or two drives per server for an active-passive configuration 
(configuration types will be defined later.) 

Each block of data residing on the clustered volume is mirrored to the other disk 
device located on the other server forming a mirrored set. A mirrored set consists of two 
different partitions (on different disk devices and on different servers) logically combined 
to look like a single volume. If the I/O card or disk drives on one server fails, nothing 
happens to data access because there is still an active I/O card or disk device inside the 
mirrored set. This is the same benefit received if one is mirroring two drives internally on 
a server and a drive fails; users can still access data from the remaining drive in the 
mirrored set. If one of the disk devices in the mirrored set becomes unavailable for any 
reason, a delta file is generated that marks the blocks of data that have been changed 
since the drives were unavailable. The process used by the delta file is capable of holding 
as much as 2GB of changes in one 64K block. Therefore, it is unnecessary to allocate 
additional drive space for buffering changes when mirroring is prevented. [Ref. 21] 

For a cluster server to take over the functionality of a server with mission-critical 
applications, it must have these application files and their support modules along with the 
same registry keys that make up the application. This can be achieved by application 
support scripts or can be done manually. To manually stage a registry', the user must 



69 



install the application to the other machine within the cluster using the same directory 
paths that were used when the application was installed on the original server. Command 
files can then be created that start the installed application on the other server when 
failover is processed. [Ref. 22] 

Each server within the cluster maintains its own resources. If any of these 
resources are clustered using Co-StandbyServer, they are associated with an alias NetBios 
computer name known as a failover group. Failover group is simply a server alias name 
container that can move between physical server hosts as needed. The failover group can 
be activated on either server within the cluster. Activating a failover group on a server 
allows users to see the alias computer name on the network that are pointing to the server 
hosting the clustered resources. [Ref. 21] 

Co-StandbyServer provides manual and automatic failover capabilities. Manual 
failover can be employed for scheduled maintenance or for load balancing purposes. In 
automatic failover, Co-StandbyServer sends heartbeat checks across both the client 
network and the dedicated link between the servers (when used). Only a failure of both 
links causes a failure condition. 

When a failure is detected, Co-StandbyServer checks the properties of the 
Failover Groups. If there are any Failover Groups currently active on the failed server 
and they are configured to automatically failover, Co-StandbyServer takes action on the 
surviving server to prepare it to receive the resources of the failed server. The resources 
of the failed server are then activated on the surviving server and the failover process is 
complete. Automatic failover occurs only in the event of a failure of one of the cluster 
servers. By definition, after an automatic failover, the cluster is no longer in a protected 



70 



state and the surviving server resources are at risk of another server failure. The 
automatic failover is in contrast to manually moving a failover group, which does not 
alter the availability state of the cluster. [Ref. 22] 

When a server failure causes an automatic failover of a Failover Group, the 
cluster is considered to be in a failed over state because there is only one host server. This 
condition should be repaired as soon as possible in order to return the system to its 
original high-availability state. The necessary steps for recovery depend on what caused 
the server to fail. If a server hosting a failover group fails and the automatic failover 
option is armed, the failover group will be activated on the surviving server. If the 
automatic failover option is not armed, the failover group can be moved manually to the 
surviving server to activate the resources and make them available to users. In most cases 
when the server is repaired and returned to the cluster, it automatically resynchronizes the 
cluster volumes and is available for hosting the failover group. [Ref. 22] 

3. Double-Take 3.0 

NSI Software’s (www.nsisw.com) Double-Take is a data replication and failover 
software product. The process begins by identifying the mission-critical data to be 
protected. The machine that holds the original copy of this data is known as the source 
machine. The selected data, known as the replication set, is then copied to another 
computer, known as the target machine. The target machine, on a local network or at a 
remote site, stores the copy of the critical data from the source machine. 

After the target has a copy of the source’s data, Double-Take monitors any 
changes to the data contained in the replication set and sends the changes to the target 



71 



machine through a process known as replication. Double-Take replicates only the 
changes rather than copying an entire file. [Ref. 18] The replication process does not 
require a dedicated link between the source and the target machine. 

The failover module resides on the target server, and continually monitors the 
source servers. In the event of a server failure, the target server can assume the names and 
EP addresses of the failed servers (in addition to its original name), and invokes scripts to 
restart applications. 

Double-Take components can be classified in two groups: server components and 
client components: 

Server Components: 

1. Double-Take Service: This service controls the core functionality of Double- 
Take including mirroring and replicating as well as failover functionality for 
the source machine. 

2. Server Monitor Service: This service controls the failover monitoring 
functionality on the target machine. 

3. Logger Service: This service controls the Double-Take logging utility. This 
utility logs alerts (notifications, warnings, and errors) that occur during 
Double-Take processing. 

Client Components: 

1. Management Console: This client is a graphical user interface where you can 
work with all aspects of Double-Take including failover configuration. 

2. Text Client: This client is a full-screen, text-based client, which uses the 
Double-Take Command Language. 



72 



3. Command Line Client: This client is a line-by-line, text-based client that uses 
the Double-Take Command Language. 

4. Failover Control Center. This client is a graphical user interface, which can 
be used to configure and monitor all aspects of Double-Take failover. 

During the mirroring process, the Double-Take transmits the data contained in a 
replication set from the source to the target machine so that an identical copy of data 
exists on the target machine. All file attributes and permissions are also mirrored to the 
target machine. Mirroring must occur initially to generate a baseline copy from the source 
to the target. After mirroring has occunred, replication maintains an identical copy of the 
data on the target. Figure 4.2 shows the different steps that are completed when a mirror 
is performed. Mirroring process includes following steps: 

1. Mirroring is initiated by the user, either manually through the one of the 
clients or automatically when the connection is created. 

2. Double-Take determines which data needs to be sent to the target depending 
on the mirroring criteria that was specified through the client. If it is a full 
mirror, all of the files are immediately sent to the target. If it is a file 
differences mirror, the files contained in the replication set on the source are 
compared against the identical copy of the replication set on the target to 
determine which files need to be mirrored. 

3. Double-Take transmits the mirror data to the target machine. 

4. As each packet of mirror data is received on the target, the target returns an 
acknowledgment to the source confirming that the mirrored data has been 
received. 



73 



step 




Client Machine 




Figure 4.2 Double-Take Mirroring. [From Ref. 18] 



Double-Take’s replication process operates at the file system level and is able to 
track file changes independently from the file’s related application. Once the source and 
target have been connected, Double-Take begins tracking file system changes that affect 
the data included in a replication set. During replication, Double-Take immediately 
records these file changes and groups them in packets. The packets are placed on a queue 
corresponding to each connection. Double-Take accumulates packets on the appropriate 
queue until the transmission of the packet to the target has been successful. When the 
target receives the packet, it responds with an acknowledgment and the source removes 
the acknowledged packet from the queue. Figure 4.3 shows the components that are 
involved in the replication process. Replication process includes following steps: 

1. The operating system handles all file requests when an application creates, 
modifies, or deletes data on the source machine. 



74 



2. The file requests are intercepted by the Double-Take driver, DblHook.sys, on 
a Windows NT source machine or by the Double-Take File System (DTFS) 
on a UNIX source machine. 

3. DblHook.sys or DTFS forwards all file requests to the file system and to 
Double-Take. The file system writes the operation to disk on the source 
machine and Double-Take converts the file requests into replication packets. 

4. The Double-Take source sends the replication packets to the Double-Take 
target where they are applied to the target copy of the data. 



OODDDOO 






Source Machine 



c i c .* i Double-Take 
F"e System! Serv|ce ■* 



DTFS or DblHook.sy$ 



Operating System 




Figure 4.3 Double-Take Replication Components. [From Ref. 18] 



Double-Take monitors the status of machines by tracking network requests and 
responses exchanged between monitored source machines and the target machine. The 
target sends a monitor request to each monitored IP address at a user-defined interval. A 
monitor reply is sent from the source back to the target. When the user-defined number of 
missed packets is met, the address is considered “failed.” At this time, the failover 
process occurs or manual intervention is requested. In the event of a failover on a 
Windows NT machine, the target assumes the identity of the failed source including 



machine name, IP address, and subnet mask. Failover also send updates to routers and 

75 



other machines to update the DP to MAC address mapping. Network packets and 
applications destined for the failed DP address are routed to the target machine. 

Depending on the type of client workstations, the timeout settings, and the 
applications in use, the clients may notice only a slight pause while the failover process 
occurs. If the failover timeout is set for a duration such as several minutes, clients may 
see an Abort or Retry message if they try to communicate with the source before the 
timeout has expired and the failover process has completed. For most clients and network 
aware applications, reconnection is automatic. By incorporating user-defined failover 
scripts into the process, network administrators can automate many network and 
application events on the target machine, such as starting applications or system services 
or sending network messages to administrators. [Ref. 18] 

Double-Take’s failover capability can be used on the following types of networks: 

• Local Area Network (LAN) 

• Wide Area Network (WAN) with Virtual LAN (VLAN) 

• WAN with WINS or DNS reconfiguration 

• WAN with source servers using secondary IP subnets. 

On a LAN, Double-Take can failover servers without any additional network 
addressing concerns. During failover, the target server assumes the IP address and 
machine name of the failed source server while maintaining its own identity. After the 
target server assumes the identity of the source server, it sends out an Address Resolution 
Protocol (ARP) reply broadcast so that all machines on the LAN will send packets to the 
target server. This reconfiguration can be completely transparent to the clients. [Ref. 19] 



76 



Failover can also happen on WAN with VLAN. A VLAN is a group of devices 
that logically appear as local to each other, but are separated physically. The routers 
and/or switches in a WAN hide the true location of devices in a VLAN from each other. 
The routers and switches handle ARP requests to ensure all devices on a VLAN can see 
each other as local devices, regardless of the actual network distance between the 
devices. In a VLAN, the routers and switches allow Double-Take to failover across a 
WAN. It allows the source and target servers to be on the same logical IP subnet even 
thought they are separate. The Figure 4.4 shows IP addressing in a normal VLAN 
environment and the Figure 4.5 shows what happens during failover across VLAN. The 
CLIENT, will still see the server SOURCE as local even it is on a physically different 
network. [Ref. 19] 




Figure 4.4 Normal VLAN Operations. [From Ref. 19] 



In a typical WAN environment, an DP address that is valid on one network 

segment will not be valid on another segment. In this case Double-Take needs a way to 

change the DP address associated with a server name during the failover process. This 

77 



allows clients to send packets to the same server name, but to a different IP address. 
Changing the IP address associated with a server name can be accomplished with WINS 
or DNS scripting. [Ref. 19] 




Figure 4.5. Failover on a VLAN. [From Ref. 19] 



Another method of failover is to failover on WAN with source servers using 
secondary IP subnets. Using this method, each source server talks to a router via a unique 
secondary address on a router port. When failover occurs, the secondary address is 
moved from the router port associated with the source server to the router port associated 
with the target server. The routing protocols (i.e. RIP, OSFP, IGRP, EIGRP) in use will 
update all routers and let them know that the sub-net is now in a different location. Figure 
4.6 shows a sample configuration before failover and Figure 4.7 shows the same 
configuration after failover (the sub-net mask used in the entire configuration is 
255.255.255.0). [Ref. 19] 



78 










Figure 4.6. WAN Source Server Using Secondary IP (before failover.) [From Ref. 19] 




; Frame Relay; 
• Network • 




Figure 4.7. WAN Source Server Using Secondary IP (after failover.) [From Ref. 19] 



79 




After the problems of the failed server are corrected, the network administrator 
manually initiates a process called fallback. The fallback refers to the reinstating the 
repaired source server in the network. In order to avoid conflicts, the source machine 
should not be reattached to the network until Double-Take has completely removed the 
source’s identity from the target. Depending on the type of machine and data that 
Double-Take is protecting, fallback may need to be scheduled for an inactive period. If 
failover is being used in conjunction with Double-Take replication or if a drive on the 
source was replaced, the data on the source may not be current. It may be necessary to 
restore the most recent data from the target machine to the proper location on the source 
before initiating the fallback process and bringing the source back online. 

Users may notice an interruption at their workstations during fallback. This delay 
will occur between the completion of the fallback process and the time needed to bring 
the source machine back online. Like failover, network administrators can incorporate 
user-defined fallback scripts into the process to automate many network and application 
events on the target machine, such as starting applications or system services/daemons or 
sending network messages to administrators. [Ref. 18] 

Double-Take can be configured in various forms. As shown in Figure 4.8, sample 
configurations for Double-Take are as follows: 

• One-to-One, Active/Standby: One target machine, having no production 
activity, is dedicated to support one source machine. The source is the only 
machine actively replicating data. This configuration is appropriate for offsite 
disaster recovery, failover, and critical data backup. 



80 



• One-to-One, Active/Active: Each machine acts as both a source and target 
actively replicating data to each other. This configuration is appropriate for 
failover and critical data backup and is more cost-effective than the 
Active/Standby configuration because there is no need to buy a dedicated 
target machine for each source. 

• Many-to-One: Many source machines are protected by one target machine. 
This configuration is appropriate for offsite disaster recovery. Many-to-one 
configuration is also an excellent choice for providing centralized tape backup 
because it spreads the cost of one target machine among many source 
machines. 

• One-to-Many : One source machine sends data to multiple target machines. 
The target machines may or may not be accessible by one another. This 
configuration provides offsite disaster recovery, redundant backups, and data 
distribution. For instance, this configuration can replicate all data to a local 
target machine and separately replicate a subset of the mission-critical data to 
an offsite disaster recovery machine. 

• Chained : One or more source machines send replicated data to a target 
machine that in turn acts as a source machine and sends selected data to a final 
target machine which is often offsite. This is a convenient approach for 
integrating local high availability with offsite disaster recovery. This 
configuration moves the processing burden of WAN communications from 
the source machines to the target machines. 



81 




Figure 4.8. Double-Take Configuration Options. 

4. Endurance 4000 

The Endurance 4000 is developed by Marathon Technologies Corporation 
(www.marathontechnologies.com) . It aims at providing 99.999% availability for the 
Windows NT server. Instead of the failover-based approach, Endurance 4000 uses 
hardware redundancy and duplicates execution and processing on multiple systems at the 
same time to make system failure transparent to application users. In Endurance 4000 
hardware redundancy is achieved by combining four PCs into a one fault tolerant server 
as shown in Figure 4.9. The PCs are grouped into two tuples as shown in Figure 4.10. 



82 




Each tuple consists of a Compute Element (CE) and an Input/Output Processor (IOP) 
connected together. In each tuple one PC does the computation and the other one process 
the I/O operations. 




Figure 4.9. Configuration of the Endurance 4000. 

Endurance 4000 consists of the following components: Four PCI-based Marathon 
Interface Cards (MICs) and interconnect cables, two SplitSite Data Links (SSDLs) and 
an Endurance software CD. There are four basic concepts underlying the Endurance 
technology that supplies and manages the necessary redundancy: 

1. Division of labor: the user’s computing tasks are logically and physically 
separated from I/O activity. 

2. System redundancy : the system is configured redundantly, providing 
significant availability and data integrity. 



83 



3. Marathon software: the system performs tasks such as error checking, fault 
isolation, synchronization, and system management. 

4. SplitSite capability : a portion of the Endurance system can be placed in a 
different geographic location, providing an instant “hot site” should one of the 
sites be rendered inoperable due to a disaster. 

A Marathon Interface Card (MIC) is used to connect the systems together and to 
perform tasks needed to support fault tolerance and disaster tolerance. A network-like 
I/O redirector is used to redirect all I/O from the CEs to the IOPs. All operating system 
and application I/O calls are redirected to the IOPs for processing. 

Both CE and IOP consist of standard Intel-based systems running the Microsoft 
Windows NT operating system. The CE is dedicated to running all application and 
operating system software, and has no I/O devices such as disks or LAN cards. The IOP 
is dedicated to performing all I/O device operations. Because the IOPs accommodate all 
of the system’s mass storage and I/O, the CEs only need to contain a MIC, a high speed 
CPU, and the memory needed to run the operating system and user’s applications. The 
CE’s failure rate is much lower than a fully configured system because the components 
that fail most frequently, disk drives and network cards, are not present. 

A tuple forms a single logical server, but actually consists of two servers, each 
running the Microsoft Windows NT operating system. With its MIC-based interconnect 
technology, the Endurance product provides the ability to use a single VGA display, 
keyboard, and mouse, which can be switched between the CE and IOP. [Ref. 23] 



84 



Compute Element 




Figure 4.10. Marathon Tuple. [From Ref. 23] 



In an Endurance 4000, system redundancy is provided both by the I/O and 
compute processing redundancy. I/O redundancy is implemented by using two I/O 
Processors, (as shown in Figure 4.11,). I/O to a disk used by the software running on the 
CEs occurs simultaneously on two disks, one on each IOP, ensuring that two copies of 
the data are always available. This configuration forms a RAID Level- 1 type storage 
system without the need for a special RAID controller. 

Similarly, there are two network interface cards, one on each IOP, to support 
application network I/O for applications running on the CE. This provides the 
redundancy required to guarantee network access to the applications running in the CE, in 
the presence of a failure of a network card. The 100BaseT interconnect between the two 
IOPs providing a path for communications between the IOPs to isolate and manage 



85 



failures. This path is also used to re-mirror a failed disk or system after repair. Re- 
mirroring occurs as a background task and is invisible to the user. [Ref. 23] 



Compute Element (CE) 



Compute Element (CE) 




Figure 4.11. Endurance 4000 Array. [From Ref. 23] 

The Compute Elements run in lockstep so that the failure of one CE is completely 
invisible to the end user as the remaining CE continues to compute through the failure. 
Also, subsystems can be repaired or upgraded while the system is online. 

Marathon software is comprised of the following components: I/O handlers , 
monitor, synchronization and Marathon System Manager. The I/O handlers manage all 
the data movement between the hardware components of a Marathon system. They 
intercept all I/O requests in the CE and forward them to the IOPs, as well as receive the 
I/O requests from the CE and send them for processing to Windows NT drivers in the 
IOPs. All of these operations are done through the MIC, which also checks and 
compares data. [Ref. 23] 



86 










The Monitor runs in the IOPs. It manages the flow of data through the IOPs as 
well as coordinates the activities of the IOPs. It removes a subsystem when faults occur 
and, when a subsystem is repaired, the Monitor joins and restores the repaired subsystem 
into the Endurance configuration. [Ref. 23] 

The fault handling software detects and isolates faults using the following two 
techniques. First, the software handles hard failures that can be isolated using timers and 
error detectors. Second, the software follows a rule-based system that uses past history 
and a set of rules to identify failures that are more difficult to isolate. 

Marathon’s synchronization software runs in the CEs allowing both CEs to 
function in lockstep. If one CE fails, the synchronization software removes the failed CE 
from the Endurance system and enables the other CE to maintain complete context and 
functionality of the operating system and all applications. After the failed CE is repaired, 
the CEs return to lockstep using software synchronization. [Ref. 23] Although any failed 
component can be repaired without interrupting the service, software running on the 
Compute Elements cannot be upgraded while the server is running. Scheduled down time 
is required to enable complete upgrades. 

The Marathon System Manager (MSM) is a GUI-based management tool for the 
Endurance system. Using MSM, a user can manage, monitor, and configure the system 
remotely or from the local Endurance system (i.e., it can run on each CE or IOP and 
provide a view of the entire Endurance configuration as seen by each CE and IOP.) 
MSM allows visibility to the current status of an Endurance system and allows the system 
manager to monitor and correct failures. [Ref. 23] 



87 



Marathon’s SplitSite technology allows Endurance tuples to be placed in different 
geographic locations, providing the user with an instant “hot site” should a disaster occur. 
As shown in Figure 4.11, the two system tuples can be separated up to 1.5 kilometers 
with Marathon’s SplitSite link. [Ref. 23] 

Marathon’s architecture not only provides protection against hardware faults but 
also provides protection against transient software faults. It can also detect operating 
system failures and automatically restart the system. The I/O Processors (IOPs) run 
Marathon’s I/O management and the fault handling software, leaving this portion of the 
system isolated from the loads placed on the Compute Element (CE) by the user’s 
applications and operating system. 

Further, since the IOPs handle all interrupts, the CE, (where the operating system 
and user applications are running,) is not subjected to the usual stream of interrupt 
asynchrony. Here, interrupts are managed through a structured process that eliminates a 
major source of asynchrony-induced software failures. Although the IOPs are subjected 
to these asynchronies, since there are two autonomous IOPs in a full fault tolerant system, 
an interrupt-induced software asynchrony will likely only affect one of the IOPs. [Ref. 
23] 

This isolation and the structured nature of the I/O software environment in both 
the CEs and the IOPs masks transient software faults, thus delivering a level of operating 
system software fault tolerance unavailable in any other system. Although masking 
transient software faults does provide more protection from software faults than has been 
previously available, the likelihood that an operating system software bug could cause an 
interruption of service is not eliminated. Since transient software faults account for only 



88 



75% of bugs found in typical commercial operating systems, there are still some bugs, 
which may cause the CE operating system to crash, delivering the now famous Windows 
NT “blue screen of death” to the user. In this event, the only option is to reboot the 
system. 

Since the I/O Processors are independent processors that have full visibility of the 
activities in the CEs as well as control over them, they can be programmed to reboot the 
operating system running on the CE. At the user’s option, the Marathon software can be 
configured to reboot the operating system automatically, making the system available to 
the users until the same software bug shows up again. It should be noted that the 
Marathon architecture also offers the IOPs the opportunity to observe the operation of the 
applications running on the CEs, and to restart a “hung” application. [Ref. 23] 

5. Octopus 3.2 

Octopus was originally developed by FullTime Software Corporation, which was 
acquired by Legato Systems (www.legato.com ') on April 1, 1999. Octopus is a software- 
only solution for Windows NT that provides data mirroring and failover capability. Its 
aim is to increase both the availability and the reliability of Windows NT servers. 
Octopus operates over both Local Area Network (LAN) and Wide Area Network (WAN) 
connections and allows remote administration and installation. Users install Octopus on 
each Windows NT machine that will be used as a source and/or target. Octopus runs over 
any network interface supported by Windows NT and does not require a dedicated NIC. 
However, dedicated NICs can also be used to allow users the option of keeping Octopus- 
related traffic off their networks. 



89 



Octopus provides the following software components: 

1. Octopus Client : This is the graphical user interface (GUI) of the product. It is 
used to configure Octopus, start Octopus services, and administer the 
Octopus. The Client provides capabilities for remote administration of 
Octopus. 

2. Octopus Service: This component provides capabilities for switch-over and 
data mirroring. This component runs as a service in the Windows NT server. 

3. Octopus Device: This component works with the Octopus Service to provide 
capabilities for data mirroring. 

4. Octopus SETUP and UNINSTALL programs: These programs are provided 
for installing and removing Octopus on supported Windows NT platforms. 

5. SNMP Agent Extension DLL: Extends the standard Windows SNMP service to 
allow Octopus to send its messages as Simple Network Management Protocol 
(SNMP) events, allowing Octopus to work with systems management 
software. 

6. Performance Monitor DLL: Extends the Windows NT Performance Monitor 
to provide performance statistics on Octopus in the Windows NT Performance 
Monitor. 

Octopus provides three functions that allow Octopus to mirror protect data 
including: (1) Mirroring, which captures changes in data at the source system, (2) 
Forwarding, which sends changes in data from the source system to the target system, 
and (3) Updating, which applies changes as stored in the receive log on the target system 
to the files on the target system. 



90 



Octopus updates target systems with changes in data as they occur, rather than re- 
sending all of the data at once. On each source machine, the user specifies which drives, 
directories and/or files they wish to replicate and the target system where the replicated 
files will reside. As shown in Figure 4.12, each time a change to a specified file is 
committed to disk on the source machine, Octopus mirrors it to the Octopus send log. 
Octopus then uses any Windows NT supported protocol to forward the change across the 
network to the Octopus receive log on the target machine. Finally, Octopus writes the 
change to the appropriate file on the target drive. [Ref. 20] 

Octopus can mirror data in one-to-one, one-to-many, many-to-one or many-to- 
many configurations. As a result of this configuration options, it can be used as a data 
transport mechanism for data distribution and localization systems, in addition to data 
protection for applications. Conversely, applications that require collecting data at remote 
locations and continuously forwarding it to a centralized site can employ Octopus’s 
many-to-one replication configuration. [Ref. 20] 




Figure 4.12. Data Protection Operation in Octopus. [From Ref. 20] 

Octopus offers two kinds of switch-over (failover) capability. Automatic Switch- 

Over (ASO) and Super Automatic Switch-Over (Super ASO). The difference between 

91 



ASO and Super ASO is that Super ASO provides three additional capabilities [Ref. 20]: 
First, the target system can maintain its original identity while also assuming the identity 
of the failed source system, so that any clients using the target do not have their services 
interrupted. Second, the target system can simultaneously assume the identity of an 
unlimited number of source systems. Third, forwarding can continue after a Super ASO 
switch-over; with ASO, forwarding must cease since the original name of the target 
system “disappears” from the network. 

On the source machine users configure a heartbeat between the source and target 
machines. The heartbeat identifies how often the source system sends "I'm alive," 
messages and how long the target system should wait after seeing the last "I'm alive," 
message before initiating the switch-over process. If the target system does not receive an 
"I'm alive," message from the source system within the specified time, it checks the 
Windows NT registry and service database for the source machine. If Windows NT can 
find the source machine on the network, the target machine resumes monitoring the "I'm 
alive," messages. If not the target machine initiates the switch-over sequence. In larger 
networks it may take longer to search the Windows NT registry and service database for 
the source machine. 

During the switch-over process, the target machine assumes the machine name 
and, if specified by the user, the IP address(es) of the source machine. Users can identify 
services and/or applications to stop or start before or after the switch-over process has 
completed. After switch-over, the target can maintain its own identity as well as the 
identities of failed source machines. Users on the network can continue to work unaware 
that their server has failed and that the target machine has taken over. [Ref.20] 



92 



B. RECOMMENDATIONS 



MSCS, WLBS and third-party products presented above all try to tolerate server 
failures and provide highly available services to clients within the Windows NT server 
environment. In selecting one of these products for the local area fault tolerance of the 
SAAM server, our main concern is to find out which one best meets the SAAM server 
fault tolerance requirements. 

MSCS provides clustering solutions only to applications that are MSCS-aware. In 
other words, in order to benefit from MSCS, the application to be protected must be 
written using the specific API. Moreover, in an MSCS cluster, the two servers can 
typically be no further apart than allowed by the shared SCSI bus. The maximum 
distance is about 25 meters, mandating that the location of the standby server be in the 
same room. If there is a fire or a power outage, it is likely the whole cluster will be shut 
down. 

Furthermore, Microsoft claims that MSCS recovers from a server failure in 
around one minute. However, the actual time needed for the recovery depends on the 
application type. Some users of MSCS complain that the failover process is too long for 
some applications, in some cases taking more than 30 minutes. [Ref. 29] Because of these 
drawbacks, MSCS does not satisfy the SAAM server fault tolerance requirements. 

WLBS is designed for high availability and scalability of TCP/IP-based services 
such as Web servers, streaming media. Virtual Private Networking (VPN), and proxy — 
services generally considered to be “stateless.” However, the SAAM server provides 
statefull services to routers. In other words, the service provided to routers by the SAAM 

server totally depends on the data residing on the PEB, which is updated frequently. 

93 



Therefore WBLS cannot be considered as a solution for the local area fault tolerance for 
the SAAM server. 

Since the products bundled with the Windows NT Server Enterprise Edition could 
not meet the local area fault tolerance requirements of the SAAM server, a solution must 
be selected from the third party products. In order to compare and contrast these products, 
we classify them into two groups according to their approaches used for tolerating server 
failures. Among these third party products, ARCserve Replication, Co-StandbyServer, 
Double-Take and Octopus use the failover approach whereas. Endurance 4000 uses the 
hardware redundancy approach for tolerating the server failure. 

Specifications of the third party products that use the failover approach are 
summarized in Table 4.1. Considering the failover times of these products, it is apparent 
that the offered failover time is between 30 and 45 seconds, which is too long for the 
SAAM server. Consequently none of these products are qualified as a solution for the 
local area fault tolerance for the SAAM server. 

On the other hand, by using hardware redundancy Endurance 4000 can tolerate 
server failures in less than a second. Marathon’s Endurance 4000 product allows industry 
standard Intel based PC systems to be configured as fault tolerant servers. The Endurance 
4000 server runs the standard Microsoft Windows NT operating system and applications. 
This means absolutely no modifications, scripts, or APIs are required for SAAM Server 
applications. 

Using SplitSite technology, the PC systems connected by Endurance 4000 can be 
placed at different locations up to 1.5 kilometers apart, while they operate and appear to 



94 



users as a single fault tolerant server. This provides continuity of service in the face of 
localized disasters that are confined to one building. 

For these reasons, especially its ability to recover from server failures in 
milliseconds, we believe that Endurance 4000 best meets the criteria for the local area 
fault tolerance for the SAAM server. The main drawback of the Endurance 4000 is its 
price. Endurance 4000 costs $25,000, and the price does not include the four servers and 
multiple copies of Windows NT software. Although it is expensive, the price is justified 
when compared to the large cost of routers, switches, and networking software, and the 
amount of revenue at stake. Consequently, among all products discussed thus far, the 
Endurance 4000 is recommended as a solution for the local area fault tolerance for the 
SAAM server. 



95 





ARCserve Replication 4.0 


Co-StandbyServer 4.2 


Double-Take 3.0 


Octopus 3.2 


Price 1 


$2995 


$4499 


$3750 


$2998 


Built-in replication 


Yes 


Yes 


Yes 


Yes 


Replication file 
updates only 


Yes 


No 2 


Yes 


Yes 


File/directory level 
selection 


Yes 


No 


Yes 


Yes 


Replication before 
write- through to disk 


Yes 


No 


Yes 


Yes 


Open file mirroring 
and replication 


Yes 


Yes 


Yes 


Yes 


Limitation of 
bandwidth usage on 
the network 


No 


No 


Yes 


Yes 


One-to-many 

replication 


Yes 


No 


Yes 


Yes 


One-to-many failover 


No 


No 


Yes 


Yes 


Allows dissimilar 
hardware and drive 
configurations 


Yes 


No 


Yes 


Yes 


Allows custom 
Scripts or batch files 


Yes 


Yes 


Yes 


Yes 


System requirements 


•Intel x86 or better 
•6MB of disk space 
•32MB of RAM 
• One NIC per server 


• Intel x86 or better 

• 30MB of disk space 

• 32MB of RAM 

• 3 physical hard disks 
(active/active) 

• 2 physical hard disk 
(active/passive) 

• One NIC per server 
(two is 

recommended) 


• Intel x86 or 
better 

• 40MB of disk 
space 

• 16MB of RAM 

• One NIC per 
server 


• Intel x86 or better 

• 15MB of disk space 

• Additional free disk 
space on target 
machines equal to 
size of replicated 
files plus 10% 

• 32MB of RAM 

• One or more NICs 
per server 


Failover time 3 


45 seconds [Ref 30] 


30 seconds [Ref 29] 


45 seconds [Ref 30] 


30 seconds [Ref 31] 


Suitability for SAAM 
(local area ft,) 


No 


No 


No 


No 



Table 4.1. Specifications of Products, Implementing Failover Approach. 



1 Prices are for two server configuration and bases on [Ref. 30] 

2 Entire disk block must be transmitted 

3 Failover time is based on the SQL server failover 



96 




V. REMOTE AREA FAULT TOLERANCE FOR SAAM SERVER 

This chapter focuses on the remote are fault tolerance for SAAM server and 
consists of three main sections. In the first section, first, the overview of the designed 
model is presented. After that, the details of the designed model are discussed according 
to the fault tolerance phases explained in Section II. D. In the second section, the 
integration of the model with the existing SAAM server source code is explained. 
Finally, in the third section, testing of the implementation is discussed. First, the testbed 
is explained, and then the test results are presented. 

A. MODELING 

Remote area fault tolerance of the SAAM server is provided primarily using the 
redundancy approach. Specifically, a redundant backup server is used (see Figure 5.1). 
Routers are required to send updates to both the primary and the backup servers. The 
backup server maintains its own PIB in parallel with the primary server. However, the 
backup SAAM server does not respond to router’s requests until it detects a failure of the 
primary server. 




97 



1 . 



Server States 



For the proposed primary- and-secondary approach, the states of the SAAM server 
and the state transitions are identified and illustrated in Figure 5.2. Both the primary and 
the backup servers begin their operation from the initial state. The initial state refers to 
the state of the server node* prior to the server software agent installation. The server 
software agent is a mobile Java class file sent by the system or network administrator, 
and installed on the server node. With the server software agent, the server node can be 
initialized either as a primary server or backup server. 

Upon installation of the primary server software agent, a server node becomes the 
primary server and enters an active running state. In the active running state, the primary 
server provides services to routers such as flow routing table entry updates and flow 
responses. 

Upon installation of the backup server software agent, a server node becomes the 
backup server and enters a silent running state. In the silent running state, the backup 
server maintains its PEB the same way as the primary server, but does not respond to 
routers’ requests. Additionally, the backup server continuously monitors the health of the 
primary server. The mechanism for such monitoring is described in Section 2. When the 
primary server fails, it effectively enters a failed state. When the backup server detects a 
failure, it enters the active running state and takes over the functionalities of the primary 
server. While the primary server is running, if the backup server fails, then the backup 
server enters the failed state. 



* Server node is a host on the network that will serve either as a primary or as a backup SAAM server. 

98 



The failed primary server stays in the failed state until it is corrected. After repair, 
the administrator may choose to reinstate the repaired server to the network either as a 
primary or as a backup. If the administrator wants to reinstate the repaired server as the 
backup, then the repaired server enters the silent running state. However, if the 
administrator wants to reinstate the repaired server as the primary, then the repaired 
server enters a PIB synchronization state. 




Figure 5.2. State Transition Diagram of a SAAM Server. 



99 



In the PIB synchronization state, the repaired server rebuilds its PIB from the link 
state advertisement (LSA) messages received from the routers. After the PIB 
synchronization is completed, the server sends a “primary server id” message to the 
currently active primary server. Then, the repaired server enters the active running state, 
while the backup server returns to the silent running state. 

There are four major phases in our design for remote are fault tolerance of the 
SAAM server: failure detection, damage confinement and assessment, error correction 
and fault treatment and continued service. 

2. Failure Detection 

In most distributed systems, failure of a system component is detected by 
implementing a periodical message exchange mechanism among the system components. 
This type of message exchange mechanism is called heartbeat protocol. Specifically, a 
working component must periodically emit “beat” messages to show that it is operating 
properly. If the component fails to emit a beat message within the timeout period, then its 
failure is detected. Since heartbeat protocols check timing related constraint of a system, 
they can be considered as an example of timing checks, discussed in Chapter n. 

Many different uses of heartbeat protocols are reported in the literature. For 
example, they are used in process termination in distributed programs (if a process an a 
program terminates or fails, then the remaining processes in the program terminate) [Ref. 
26], network protocols [Ref. 27], reaching agreement on processor-group membership 
[Ref. 33], and mobile computing [Ref. 34]. In the model designed in this thesis, a 
heartbeat protocol is also used for the SAAM server failure detection. 



100 



When designing a heartbeat protocol, the protocol designer should strive to 
incorporate the following three ideal characteristics into his design [Ref. 26]: 

1. The rate at which heartbeat messages are sent in the protocol should be small 
in order to reduce protocol overhead. 

2. The detection delay (time difference between the detection time and the actual 
failure time) should be small, in order to improve protocol responsiveness. 

3. The probability of false detection should be small, in order to increase 
protocol reliability. 

In any heartbeat protocol implementation, it is impossible to incorporate all these 
three objectives at the same time, because they are somewhat contradictory. For example, 
to reduce both the rate of sending heartbeat messages and the detection delay, the 
protocol should allow only a small number of missed heartbeat messages. In this case, 
probability of false detection would increase. Therefore, every heartbeat protocol is a 
compromise between these contradictory objectives [Ref. 26]. 

Remote area fault tolerance for the SAAM server is mainly focused on tolerating 
environmental faults such as fire, earthquake, and flood, which cause unrecoverable 
server failures. Therefore, it is essential to locate the primary and the backup SAAM 
servers as much apart as possible in the SAAM region, as shown in Figure 5.1. In 
heartbeat protocol implementations, it is usually preferable to use a dedicated link 
between the two servers. By using such a dedicated link, the protocol overhead 
introduced to the network can be avoided. However, in SAAM, using a dedicated link 
between two servers is not practical, because of the long distance between the two 



101 



SAAM servers. Therefore, the heartbeat protocol to be implemented will use the existing 
network links for communications between the two servers. 

In order to select the best heartbeat protocol for detecting the SAAM server 
failures, two different heartbeat protocols that cover most of design space, constant 
heartbeat protocol and accelerated heartbeat protocol, are prototyped and their 
performance results are compared. The following sections will discuss these two 
heartbeat protocols, their implementations, and their performance results. 

a. Constant Heartbeat Protocol ( V.O ) 

The constant heartbeat protocol is the first protocol that we prototyped and 
evaluated. Therefore, it was given a version number zero (V.O). Additionally, due to the 
constant rate of heartbeat messages generated by this protocol, it is called “ constant 
heartbeat protocol .” 

In the constant heartbeat protocol, the primary SAAM server periodically 
sends heartbeat messages to the backup SAAM server, indicating that it is in an 
operational state. On the other hand, the backup SAAM server listens only to these 
heartbeat messages coming from the primary server. If the backup SAAM server misses a 
predetermined number of consecutive messages, then it declares the failure of the 
primary SAAM server. 

Let the time interval for sending heartbeat messages from the primary 
SAAM server to the backup SAAM server be t, and the maximum number of 
consecutive message misses allowed (allowed-miss) be n . The following equation relates 
t , n , and the detection delay denoted by d : 



102 



(5.1) 



(t • n) < d < (t • (n + 1)) 

Figure 5.3 illustrates this relationship. In this case, n is set to two. 
According to the figure, after the first two heartbeat messages, the backup SAAM server 
did not receive any heartbeat messages. Since n is equal to two, after three consecutive 
misses, the backup SAAM server declares the failure of the primary SAAM server. The 
actual failure, if there is one, must have occurred at some time during the second interval. 
Therefore, the vertical arrows labeled and d ma show the possible maximum and 

minimum detection delay times, respectively. In other words, detection delay d , should 
be less than 3 1 but greater than It . 



Primary SAAM Server 




Backup SAAM Server 
n= 2 



t 



A 

V 



Heartbeat messages 
1 st interval 



► 



received 



t 

t 

t 



/\ 

v 

/\ 

\/ 

a' 

V 



/ 

2 nd in 


S 

erval 

£ 


► 


3 rd in 


r 

erval 


\ T 


4 th in 


erval 

/ . . \ 


W 

/ 



^max ^min 



received 

not received (allowed #1) 

not received (allowed #2) 

not received (did not allowed) 
and failure is detected 



Figure 5.3. The Detection Delay in the Constant Heartbeat Protocol. 



103 



b. Accelerated Heartbeat Protocol (V.l) 

The concept of accelerated heartbeat protocol was introduced by Godua 
and McGuire in 1998 [Ref. 26]. They use such protocols as a process termination 
mechanism in distributed programs to ensure that if a process in a program terminates or 
fails, then the remaining processes in the program also terminate. The same concept is 
adapted to the SAAM server as failure detection resulting in the development of the 
second version (V.l) of SAAM server heartbeat protocol, which we refer to as the 
accelerated heartbeat protocol. 

In the accelerated heartbeat protocol, the communication between the 
primary SAAM server and the backup SAAM server is partitioned into successive time 
periods. At the end of each period, the backup SAAM server actively queries the health 
of the primary SAAM server by sending a message called heartbeat query message (“Are 
you alive?”)- After that, the backup SAAM server waits to receive a message from the 
primary SAAM server called heartbeat response message (“Yes, I am alive”). The 
received heartbeat response message indicates that the primary SAAM server is in an 
operational state. 

As long as the primary SAAM server stays in an operational state, the 
length of the period, denoted by t ^ , is constant. However, the length of the next period 
can vary depending on the events that has occurred in the current period according to the 
following three rules: 

1. The backup SAAM server sends a heartbeat query message to the 
primary SAAM server and receives a heartbeat response message 



104 



within the first half of the current period. In this case, the backup 
SAAM server makes the length of the next period (irrespective of 
the length of the current period.) 

2. The backup SAAM server sends a heartbeat query message to the 
primary SAAM server, but does not receive a heartbeat response 
message within the first half of the current period. In this case, the 
backup SAAM server immediately sends another heartbeat query 
message, and reduces the length of the next period by half. 

3. The length of the next period ever becomes less than a specified value 
tmi n ’ which is an upper bound on the round-trip network delay between 
the backup SAAM server and the primary SAAM server. In that case, 
the backup SAAM server declares the failure of the primary SAAM 
server. 

If we assume that f^n is set to show three consecutive heartbeat response 
misses, then the operation of the accelerated heartbeat protocol would be as shown in 
Figure 5.4. After the first heartbeat query message, the backup server has received the 
heartbeat response message. Thus, it is certain that the primary server was alive at time 
(/ 0 + (f min / 2)) . Additionally, since the backup server did not receive a heartbeat response 

message for the heartbeat query message sent at /, , the failure of the primary server must 
have happened before the second heartbeat query message. Consequently, the actual 
failure of the primary server must have happened at some time between ( t 0 + (t mn / 2)) 

and r, , which is the shaded area in Figure 5.4. 

105 




Figure 5.4. Detection Delay in the Accelerated Heartbeat Protocol. 



After three consecutive heartbeat message misses, the failure of the 
primary server is detected (see Figure 5.4) at time t 2 . The maximum detection delay time 
and the minimum detection delay time are shown with the arrows marked by d ^ and 



106 




dmin > respectively. Therefore, the relationship between the detection delay, d , and the 



heartbeat query message interval, , is given as follows: 



max _j_ max _j_ max _j_ max 



16 



\ / 


v , > 




< d < 


a 

g CN 

1 

X 

1 


+ 


) V 


\ 



max max _|_ max _|_ J max 



16 



( 5 . 2 ) 



15 / 



16 



<d < 



^^irnx ^min 



16 



( 5 . 3 ) 



c. Prototyping of the Heartbeat Protocols 

The prototypes explained in this section are implemented with the purpose 
of making a quick performance comparison of the two heartbeat protocols without 
integrating them with the existing SAAM server source code. Their source code can be 
found in Appendix A and B. 

(1) Constant heartbeat protocol (V.O) prototype: The constant 
heartbeat protocol prototype includes the following Java class files: 

• PrimaryServer class 

• PrimaryServerThread class 

• BackupServer class 

• TimerHandl er class 

The PrimaryServer and the PrimaryServerThread 
classes are used for implementing the primary server portion of the constant heartbeat 



107 



protocol. On the other hand, the BackupServer and the TimerHandler classes are 
used for implementing the backup server portion of the protocol. 

The Primary-Server class provides a Graphical User Interface 
(GUI) to the user (see Figure 5.5). The main components of the GUI are numbered one 
through five. Specifications and functionalities of these GUI components are summarized 
in Table 5.1. 




0 



0 

0 

0 



0 



Figure 5.5. The Primary Server GUI of the Constant Heartbeat Protocol. 



108 




Number 


Name 

(source code) 


Java Object 


Functionality 


1 


display 


TextArea 


Displays the program instructions and the program 
outputs to the user. 


2 


startButton 


Button 


Initiates the heartbeat message sending process. 


3 


stopButton 


Button 


Stops the heartbeat message sending, in order to 
simulate the failure of the primary server. 


4 


intervalChoice 


Choice 


Selects the interval time (in seconds.) between the 
heartbeat messages (Options are 0.5, 1, 1.5, 2, 2.5 
and 3) 


5 


exitButton 


Button 


Terminates the program. 



Table 5.1. Specifications of the Primary Server GUI Components. 



When the user presses the startButton, Primary Server 
class creates a new thread called PrimaryServerThread, which periodically sends 
the heartbeat messages to the backup server. Unless the stopButton is pressed, 
PrimaryServerThread always loops inside in its run method. In every loop of the 
run method, the PrimaryServerThread first constructs the heartbeat message, then 
sends the heartbeat message, and finally sleeps for an interval period. As the heartbeat 
messages are transmitted to the backup server, the messages sent and the times of their 
departure are displayed in the GUI (see Figure 5.6). 

The heartbeat message transmitted by the primary server is 
implemented as a text string, and consists of four sub-strings. The first sub-string is the 
string representation of the interval value selected by the user using the intervalChoice 
menu. The second sub-string is the tilde character used as a delimiter. The third 
sub-string is the “I am alive” string. The fourth sub-string is the text representation of the 



109 



heartbeat message number. The complete heartbeat messages string are shown in quotes 
in Figure5.6. 



Constant Heartbeat Protocol (V. 0] 






PRIMARY SERVER 



Sending 


" 1.0-1 AM ALIVE #2 ” 


at: 95127S 790040 


Sending 


" 1.0- 1 AM ALIVE #3 " 


at: 95127S7910S0 


Sending 


” 1.0-1 AM ALIVE #4 " 


at: 95127S792070 


Sending 


" 1.0- 1 AM ALIVE #5 " 


at: 9512 7S 793110 


Sending 


"1.0- 1 AM ALIVE #6 ” 


at: 95127S794430 


Sending 


" 1.0-1 AM ALIVE #7 ” 


at: 95127S795420 


Sending 


" 1.0-1 AM ALIVE #S " 


at: 9512 7S 796460 


Sending 


" 1.0-1 AM ALIVE #9 " 


at: 95127S 797450 


Sending 


”1.0-1 AM ALIVE #10 " 


at; 95127S79S500 


Sending 


”1.0-1 AM ALIVE #11 ” 


at: 95127S7994S0 


Sending 


”1.0-1 AM ALIVE #12 ” 


at: 95127SS00640 


Sending 


”1.0-1 AM ALIVE #13 ” 


at: 95127SS01630 


Sending 


”1.0-1 AM ALIVE #14 " 


at: 95127SS02670 


Sending 


”1.0-1 AM ALIVE #15 ” 


at: 95127SS03660 


Sending 


"1.0-1 AM ALIVE #16 " 


at: 9512 7SS04 700 


Sending 


”1.0-1 AM ALIVE #17 " 


at: 95127SS05690 


Sending 


"1.0-1 AM ALIVE #1S” 


at: 95127SS06730 


Sending 


"1.0-1 AM ALIVE #19 ” 


at: 95127SS07720 


Sending 


”1.0-1 AM ALIVE #20 ” 


at: 95127SS0S770 


Sending 

Id 


”1.0-1 AM ALIVE #21 ” 


at: 95127SS09760 



I START 



STOP 



“3 



SELECT TIME INTERVAL FOR HEARTBEAT MESSAGES (sec.)|l jj 



EXIT 



Figure 5.6. The Primary Server GUI while Sending Heartbeat Messages. 



The BackupServer class provides a GUI similar to that of the 
Primary Server class (see Figure 5.7). The main components of the GUI are 
numbered one through four. Specifications and functionalities of these GUI components 
are summarized in Table 5.2. 



110 





(D 



© 



© 



Figure 5.7. The Backup Server GUI of the Constant Heartbeat Protocol. 



Number 


Name 

(source code) 


Java Object 


Functionality 


1 


display 


Text Area 


Displays the program outputs to the user. 


2 


statusDisplayLabel 


Label 


Shows the primary server’s status. If the primary 
server is running, then its color is green and 
“NORMAL” is written on it. If the primary server 
is down, then its color is red and “PRIMARY 
SEREVR IS DOWN” is written on it. 


3 


allowedMissChoice 


Choice 


Selects the allowed-miss value 
(Options are 1, 2, 3, and 4) 


4 


exitButton 


Button 


Terminates the program. 



Table 5.2. Specifications of the Backup Server GUI Components. 



Ill 





The backup server listens to the heartbeat messages coming from 
the primary server via Java Server Socket. Whenever the primary server makes a 
connection with the backup server to send a heartbeat message, the backup server creates 
a new Java Socket and retrieves the heartbeat message from the input stream. Since the 
interval value between the heartbeat messages is determined on the primary server side, 
the backup server is not aware of this value. However, when the backup server receives 
the first message, the message is tokenized and the heartbeat interval value is retrieved 
from the message string. The backup server will use this interval value when performing 
a time check on heartbeat messages. 

In order to perform time checks on heartbeat messages, the backup 
server has a Timer object called timer. After the first message is received, the timer 
is started with the initial delay of ((allowedMiss + l)-intervalValue) seconds. For 
example, if the allowed-miss is equal to three, and the time interval between heartbeat 
messages is one second, then the initial delay of the timer would be four seconds. 

Whenever the backup server receives a new heartbeat message, 
then the backup server restarts the timer with its initial delay. Thus, as long as the 
primary server continues to send the heartbeat messages, timer never expires. However, 
if the primary server fails to send heartbeat messages, eventually timer will expire 
indicating the failure of the primary server. When the timer expires, a Java action event 
is generated. This action event is heard and handled by the TimerHandler class. The 
generated action causes the execution of the actionPer formed ( ) method of the 
TimerHandler class. Declaration of the failure of the primary server is performed in 



112 



this actionPer formed ( ) method by calling the setPrimaryServer Status ( ) 
method of the BackupServer class. 

After the failure of the primary server is detected, the failure 
detection time, the receive time of the last message, and elapsed time (difference between 
the receive time of the last message and the failure detection time) are displayed, as 
shown in Figure 5.8. 



Protocol fV.O) 




BACKUP SERVER 



Received... 


“ I AM ALIVE *16 


■* Interval : 


05 


at 


:9492S2l43240 


4 


Received... 


M 1 AM ALIVE A* 17 


interval : 


05 


Hi 


;9492S2I4.MUtl 




Received... 


” SAM ALIVE SIS 


“ Interval ; 


0.5 


at 


;9492S21412S0 




Received.., 


" 1 AM ALIVE *19 


** Interval : 


05 


at 


:9492S2M<7S0 




Received... 


1 AM ALIVE *20 


* f Interval : 


05 


at 


:9492S2145320 




Received,-, 


M 1 AM ALIVE #21 


" Interval : 


05 


at 


:9492X2!45S2n 




Received-*. 


- S .AM ALIVE *22 


“ Interval : 


0.5 


at 


:9492S2I4d370 




Received... 


“ l AM ALIVE if 2? 


“ Interval : 


05 


at 


:94P2S2146S6Q 




Received... 


*• 1 AM ALIVE #24 


** Interval : 


05 


at 


:94V2X2I474UI 




Received... 


•' I AM ALIVE *2$ 


" Interval ; 


05 


at 


:9492S214791C1 




Received.., 


M 1 AM ALIVE *26 


Intend : 


05 


at 


:9 4928214 S4 00 




Received... 


" 1 AM ALIVE *2? 


Intend j 


05 


at 


:9492X214S950 




Received... 


M 1 AM AMVK#2X 


Interval : 


05 


at 


:9492X2I4944C1 




Received... 


” 1 AM ALIVE *29 


" Interval ; 


05 


at 


;9492S2I4W>0 




MAIN SERVER FAILED !!” AT : 9492X2152030 








Failure detected at : 


9492S2L52B30 










Last message received at : 


9492S2149990 










iiipsrd tirur fur dele c tin ti is : 


2U4U rtuIlivurtiiicL 






— 


LJ 












jj 



STATUS 



BfBIIW P ffl 1 « ? ■ 



ALLOWED NUMBER OF MISSES BEFORE FAILURE 



Fd 



EXIT 



Figure 5.8. The Backup Server GUI after the Failure of the Primary Server. 



113 



(2) Accelerated heartbeat protocol (V.l) prototype: The 

following Java classes are used in the implementation of the accelerated heartbeat 
protocol: 

• PrimaryServer class 

• PrimaryServer Thread class 

• BackupServer class 

• MessageTimerHandler c lass 

• AckReceiveThread class 

• AckTimerHandl er class 

The PrimaryServer and the PrimaryServerThread 
classes are used for implementing the primary server portion of the protocol. The 
BackupServer, the MessageTimerHandler, the AckReceiveThread, and the 
AckTimerHandl er classes are used for implementing the backup server portion of the 
protocol. 

The PrimaryServer class provides a GUI (see Figure 5.9) to 
the user. The main components of the GUI are numbered one through three. 
Specifications and functionalities of these GUI components are summarized in Table 5.3. 

The main responsibility of the PrimaryServer class is to listen 
to the heartbeat query messages coming from the backup server. PrimaryServer class 
listens to the heartbeat query messages via Java ServerSocket. Whenever the connection 
is made by the backup server to send a heartbeat query message, the primary server 
creates a new Java Socket and a new thread called, PrimaryServerThread to 



114 



handle the connection. The PrimaryServerThread class is responsible for retrieving 
the message string from the socket and sending the corresponding heartbeat response 
messages. Whenever the PrimaryServerThread receives a heartbeat query 
message, it displays the message arrival time on the GUI. After that, the 
PrimaryServerThread sends the heartbeat response message. The heartbeat 
response message is a string with the text of, "YES, I AM ALIVE". After the heartbeat 



response message is sent, PrimaryServerThread displays the time that the message 
has sent on the text area (see Figure 5.9). 




G> 



Received Query at 
Received Query at 
Received Query at 
Received Query at: 
Received Query at 
Received Query at 
Received Query at 
Received Query at 
Received Query at 
Received Query at 
Received Query at 
Received Query at 
Received Query at 

Received Query at 
Received Query at 
Received Query at 
Received Query at 
Received Query at 
Received Query at 
Received Query at 

LlI 



950037811650 

950037812700 

950037813740 

950037814780 

950037815830 

950037816870 

950037817910 

950037818960 

950037820000 

950037821040 

950037822090 

950037823130 

950037824180 

950037825220 

950037826260 

950037827310 

950037828350 

950037829390 

950037829940 

950037830220 



Sent response at: 950037811650 
Sent response at: 95003781 2700 
Sent response at: 95003781 3740 
Sent response at: 95003781 4780 
Sent response at: 950037815830 
Sent response at; 950037816870 
Sent response at: 950037817910 
Sent response at: 95003781 8960 
Sent response at: 950037820000 
Sent response at: 950037821040 
Sent response at: 950037822090 
Sent response at: 9500378231 30 
Sent response at; 9500378241 80 
Sent response at: 950037825220 
Sent response at: 950037826260 
Sent response at: 950037827310 
Sent response at: 950037828350 
RESPONSE DID NOT SEND 
RESPONSE DID NOT SEND 
RESPONSE DID NOT SEND 



STOP SENDING RESPONSES 







EXIT 




Figure 5.9. The Primary Server GUI of the Accelerated Heartbeat Protocol. 



115 






Number 


Name 

(source code) 


Java Object 


Functionality 


1 


display 


TextArea 


Displays the program outputs to the user. 


2 


toggleButton 


JToggleButton 


Stops heartbeat response message transmission 
with the first click and restarts the heartbeat 
message transmission with the second click. 


3 


exitButton 


JButton 


Terminates the program. 



Table 5.3. Specifications of the Primary Server GUI Components. 



The BackupServer class provides a GUI for the backup server 
portion of the accelerated protocol (see Figure 5.10). The main components of the GUI 
are numbered one through six. Specifications and functionalities of these GUI 
components are summarized in Table 5.4. 




Figure 5.10. The Backup Server GUI of the Accelerated Heartbeat Protocol. 



116 





Number 


Name 

(source code) 


Java Object 


Functionality 


1 


display 


Text Area 


Displays the program outputs to the user. 


2 


rtdButton 


JButton 


Calculates the round trip delay between two 
servers by sending four heartbeat queries. 


3 


startButton 


JB utton 


Starts sending heartbeat query messages 


4 


tmaxComboBox 


JComboBox 


Selects the tmcix value, which is the 
maximum interval time between heartbeat 
query messages (Options are 0.5, 1, 1.5, 2, 
2.5, 3, 3.5 and 4) 


5 


roundT ripDelayComboBox 


JComboBox 


Selects the tmin value, which is the round 
trip delay upper bound between two servers 
(Options are 0.05, 0.06, 0.30) 


6 


exitButton 


JButton 


Terminates the program. 



Table 5.4. Specifications of the Backup Server GUI Components. 



The main responsibility of the BackupServer class is to send 
the heartbeat query messages to the primary server, and to implement the accelerated 
heartbeat protocol rules explained in Section A.l.b. The heartbeat query messages are 
sent from the BackupServer class. However, the heartbeat response messages are 
received by the AckReceiveThread class. When the start button is pressed, the 
BackupServer class creates two Timer objects called messageSendTimer and 
ackReceiveTimer. 

The purpose of the messageSendTimer is to provide periodical 
heartbeat query message transmission. The messageSendTimer is a repeating timer 
and always restarts with its initial delay, which is set to the revalue. Whenever the 

messageSendTimer expires, the actionPerformed ( ) method of the 

MessageTimerHandler class is executed. In this actionPer formed ( ) method, 

the sendMessage ( ) method of the BackupServer class is called. After that, the 

117 



backup server sends a new heartbeat query message to the primary server. For example, if 
the is equals to two, then every two seconds the backup server sends a new heartbeat 
query message. 

The purpose of the ackReceiveTimer is to implement time 
checks on the heartbeat response messages. The ackReceiveTimer is started when 
the first heartbeat query message is sent. Its value is set to half of the current interval. If 
the AckReceiveThread receives the heartbeat response message prior to the 
expiration of the ackReceiveTimer, then the ackReceiveTimer is stopped before 
its expiration. Therefore, as long as the heartbeat response messages arrive regularly 
within the first half of the current interval, the ackReceiveTimer never expires. 

However, if the backup server does not receive a heartbeat 
response message within the first half of the current interval, then the 
ackReceiveTimer expires and causes the execution of the actionPer formed ( ) 
method of the AckTimerHandler class. In this method, the messageSendTimer is 
stopped, a new heartbeat query message is sent, and the current interval value is set to 
half of the previous interval value. If the new interval value ever becomes less than the 
round trip delay upper bound value, then failure of the primary server is declared (see 
Figure 5.1 1). However, if the backup server receives a heartbeat response message before 
determining the status of the primary server, then the message send timer is restarted and 
the current interval is set back to its original value, / max . 



118 



[Accelerated Heartbeat Protocol (V.1| 



l§j 

BACKUP SERVER 



Sent Query at 949304040660 diff- 2030 Received response at : 949304040770 , 
Sent Query at 949304042690 diff : 2030 Received response at : 949304042800 
Sent Query at 949304044720 diff 2030 Received response at ; 949304044830 
Sent Query at 949304046750 diff : 2030 Received response at : 949304046860 
Sent Query at: 949304048790 diff 2040 Received response at . 949304048900 
Sent Query at 949304050820 diff : 2030 Received response at : 949304050930 
Sent Query at 949304052850 diff : 2030 Did not receive response. ... 

Action performed >949304053900 

Sent Query at 949304053900 diff 1050 Did not receive response 

Action performed >949304054440 

Sent Query at 949304054440 diff: 540 Did not receive response 

Action performed >949304054720 

Last message send at :949304052850 
Failure detected at .949304054720 

Elapsed Time :1 870 

SERVER ls down 



Ld 



Max of four Roun-Trip-Delay is : 220 ms. 
SELECT M tmax“ VALUE FOR SENDING MESSAGES (sec) 
SELECT ROUND TRIP DELAY UPPER BOUND (sec) 



EXIT 



Figure 5.1 1. The Backup Server GUI After Failure. 



d. Performance Comparison of the Heartbeat Protocols 

In this section, the failure detection delays and other performance results 
of the two heartbeat protocols are compared. In order to avoid the unexpected network 
delay, our tests are performed by running both the primary server and the backup server 
programs on the same computer. (The processor of the test computer was a 233 MHz. 
Pentium II.) 

First, the performance of the constant heartbeat protocol is evaluated. A 

set of tests are performed using different interval values between the heartbeat messages. 

In this test, the allowed-miss values of one, two, three, and four are tested and the failure 

detection delays are calculated while keeping the allowed-misses constant (the results are 

119 



